Validation and Optimization of 3d-Human Body Pose Estimation Approaches For Use in Motion Analysis, Ab. Sami Noorzad and Malek Zedan

VALIDATION AND OPTIMIZATION OF
3D-HUMAN BODY POSE ESTIMATION

APPROACHES FOR USE IN MOTION ANALYSIS
Abdul Sami Noorzad 1F and Malek Zedan 2F

F
Hochschule Ruhr West (University of Applied Science)
†
Department of Computer Science
†
Lützowstraße 5
†
46236 Bottrop
†
Germany
†
Master Thesis
28 August 2019
1 abdulsami2014noorzad@gmail.com
2 m.zedan.en@gmail.com
TO MY FAMILY, MY TEACHERS, MY
FRIENDS AND ALL THOSE WHO
MADE A BETTER ME.
Abdul Sami Noorzad
Mom, despite the long distance between us,

I promise to live a life that will do justice to
all the sacrifices you’ve made. Also, for the
person who always stands beside me, “Heike
Reintanz.” There are no words that can
express my thanks for you. If words could be
hugs, I would send you pages. Last not
least, for my teachers, my friends and all
who supported me in my life, if the world
had more people like you, it would be a
better place.
Malek Zedan
2
Declaration
Our work in this thesis is based on research carried out at the department of computer science
in Hochschule Ruhr West (University of Applied Science).
No part of the this research has been submitted elsewhere for any other degree or qualification
and it is all our own work unless referenced to the contrary in the text.
"The copyright of this research rests with the authors. No quotations from it should be published
without the authors prior written consent and information derived from it should be acknowledged"
Copyright @ September 2019 by Abdul Sami Noorzad and Malek Zedan.

Feedback
Your feedback is welcomed! We did our best to be as precise, informative and up to the point
as possible, but should there be anything you feel might be an error or could be rephrased to be
more precise or comprehensible, please don’t refrain from contacting us. Likewise, drop us a line
if you think there is something that might fit this thesis and you would like us to discuss. We will
make our best effort to update this document.
Contents
0.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
0.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1 Theoretical Background 20
1.1 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.1.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.1.3 Deep learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.1.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.1.5 Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.1.6 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.1.7 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.1.8 Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.1.9 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.1.10 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5
1.1.11 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.2 Multilayer Neural Networks (MLNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.2.1 Feed forward multi-layer neural network (FFMLNN) . . . . . . . . . . . . . . . . 31
2 Convolutional Neural Networks 32
2.1 CNNs Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.1 Reason Behind Using CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.2 Applications of CNNs in HBPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.1 Heat Map Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2.2 Deep Pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.3 Pose regression CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2.4 CNNs for binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.5 Recurrent Neural Network (RNN) . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.6 Seq2Seq Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Activation Functions 43
3.1 What is an Activation Functions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.1 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.2 Tangent Hyperbolic (Tanh) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.3 Hard Hyperbolic Function (Hard Tanh) . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.4 Rectifier Linear Unit Function (ReLU) . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.5 Leaky ReLU or Parametric Rectified Linear Unit (PReLU) . . . . . . . . . . . . 49
3.1.6 Randomized Leaky ReLU (RReLU) . . . . . . . . . . . . . . . . . . . . . . . . . 51
6
3.1.7 S-Shaped ReLU (SReLU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.8 Softplus Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.9 Exponential Linear Unit (ELU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.10 Maxout Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.11 SoftMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.12 Softsign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Data-Sets 57
4.1 Features of Data-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 Human 3.6M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 Human Eva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.3 COCO Data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.1 Least Absolute Deviations (L1 ) Loss Function . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Least Square Errors (L2 ) Loss Function . . . . . . . . . . . . . . . . . . . . . . . 60
5 Implementation 62
5.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Used Libraries and Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.3 H5py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.4 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7
5.3 Additional supporting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Mask R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.2 Stacked hourglass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3.3 Cascaded Pyramids Network (CPN) . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Preparing the datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.1 Human3.6M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.2 HumanEva-I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Implementation and visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.1 Supervised . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.2 Semi-Supervised . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.6 Challenges and faced problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6 Evaluation 67
6.1 Reconstruction errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Supervised Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2.1 Reconstruction error on Human3.6M data-set . . . . . . . . . . . . . . . . . . . . 68
6.2.2 Reconstruction error on HumanEva-I data-set . . . . . . . . . . . . . . . . . . . . 70
7 Conclusion 75
7.1 Our Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2 Workout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3 Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.4 Futur work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9
Acknowledgement
The presented results were achieved within the framework of a Master Thesis proposed at the
University of Applied Sciences Ruhr West Bottrop.
We want to thank our supervisor, Mr. Michael Schellenbach for his constant encouragement
and support throughout these research topics and his assistance in the implementation of the idea.
We would like also to send our warm thanks to our advisor, Prof. Dr. Uwe Handemann, for
the patience and motivation he showed to us during our research. Sami would like to thank all the
team members of the Student Integration Program (SIP), especially Frau Dr. Juliane Rytz, Frau
Christiane Hinrichs, Frau Stephanie Gotza, Frau Prof. Sandra Meyer, Frau Sara Cramer, Frau
Nina Hilligloh and Herr Dr. Aloes Wollney for their administrative support and support during the
participation in the German Language course. Also, we both would like to thank our families, our
parents, and especially our classmates, for the love and constant support they gave to us over the
years. Last but not least, we would like to offer sincere thanks to our friends and all the people
nearby. They were always standing on our side in a personal and friendly level.
MALEK ZEDAN and ABDUL SAMI NOORZAD

List of Figures
1.1 Biological and artificial neural networks [81] . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.2 Multi-layer neural network [82]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.3 A basic feed forward multi-layer neural network [82]. . . . . . . . . . . . . . . . . . . . . 31
2.1 A rough diagram for a Convolutional Neural Network [84]. . . . . . . . . . . . . . . . . . 33
2.2 Architecture of a traditional convolutional neural network [85]. . . . . . . . . . . . . . . 37
2.3 Regularized multilayer RNN [85] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4 seq2seq model [86] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1 An artificial neural network and the function f is the activation part [84]. . . . . . . . . 44
3.2 Two examples of sigmoid function [87]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 ReLU activation function [87]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Leaky ReLU Activation function [87] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 The Derivatives of the most common activation functions [90] . . . . . . . . . . . . . . . 55
3.6 Sigmoidal activation functions [88]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1 Reconstruction error in Protocol 1 with Human36M . . . . . . . . . . . . . . . . . . . . 70
11
6.2 Reconstruction error in Protocol 2 with Human36M . . . . . . . . . . . . . . . . . . . . 71
6.3 Velocity error with Human36M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 Reconstruction error in Protocol 2 with HumanEva-I . . . . . . . . . . . . . . . . . . . 72
6.5 Reconstruction error in Protocol 2 with training H36M dat-set in semi-supervised . . . 74
6.6 Reconstruction error in Protocol 3 with training H36M dat-set in semi-supervised . . . 74
12
List of Tables
3.1 Existing architectures and their activation functions . . . . . . . . . . . . . . . . . . . . 54
4.1 Famous Publicly available data-sets [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 This table summarizes the features of L1 and L2 functions [91]. . . . . . . . . . . . . . . 61
6.1 Protocol 1: Compare the reconstruction error (MPJPE) of [45] with our optimizations. 68
6.2 Protocol 2: Compare the reconstruction Error after rigid alignment with the ground
truth (P-MPJPE) of [45] with our optimizations. . . . . . . . . . . . . . . . . . . . . . 69
6.3 Velocity error over the 3D poses generated by a convolutional model that considers time
and a single frame baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4 Error of HumanEva-I under protocol 2 for a single action (SA) and multi-action (MA)
models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.5 Evaluation the semi-supervised training with Human 3.6 M data-set under protocol 2. . 73
6.6 Evaluation the semi-supervised training with Human 3.6 M data-set under protocol 3. . 73
Preface
This work is a synthesis of well established material from various sub fields of Mathematics,
Artificial Intelligence, Machine Learning and Deep Learning.
Equations, mathematical derivations, and experimental results from published papers and
books are carefully referenced.
This work was submitted in partial fulfillment of the requirements for the degree of Master of
Science (Computer Science) in Ruhr West University of Applied Science.
Outline
This thesis is structured into seven chapters in addition to the abstract and introduction.
i. Chapter one enhance the theoretical background to perceive the value of the field of
deep learning especially the topics which are relatively related to our work.
ii. Chapter two reviews the main components of convolutional neural networks which is
the main part of our research and also we will discuss about recursive neural networks
because it is indirectly involved in this research.
iii. Chapter three presents a detailed description of our proposed approaches and the ex-
perimental results including different types of activation functions. Also this chapter
describes the properties of activation functions and the reason why we should use
specific activation functions in our research.
iv. Chapter four deals with the data-sets and also in this chapter we introduced the
properties of loss functions.
v. Chapter five take care of the implementation phase with summarizing our experi-
mental work and simulation to present a tangible results for our thesis.
vi. Chapter six shows the results of our empirical work and the running environment for
each one.
vii. Chapter seven concludes our works in both the theoretical and practical sides, dis-
cusses the delivered results and the encountered challenges and finally confers some
future outlines.
0.1 Abstract
Human Body Pose Estimation (HBPE) from images and videos is considered to be one of the most
ongoing, fast-growing and vital research areas in the field of artificial intelligence especially in the sub-
field of Computer Vision [10]. HBPE is ordinarily defined as estimating the configuration of a person’s
body parts or pose in an image [1]. HBPE strives to locate a sparse set of points in a given model,
corresponding to the human body joints to experience the human-computer interaction.
With the tremendous improvement of the artificial intelligence based-applications ranging

from animation to medical aid (diagnostics), with a high technical camera and the availability of more
computational power, many new methods for computer vision are explored.
Related to many other computer vision disciplines, HBPE availed from the advent and de-
velopment of DEEP LEARNING and CONVOLUTIONAL NEURAL NETWORKS (CNNs). It has
been boosted to attain impressive results in challenging datasets [24]. The critical success method for
building HBPE on deep learning networks are the technique to produce heat-maps in a loosely way to
spatially locate the joints. Furthermore, the next effective factor beyond the success of HBPE systems,
that very immeasurable results can be obtained from merely a single monocular RED, GREEN, BLUE
(RGB) image, making it proper for most portable camera systems, such as those embedded in mobile
phones. However, the use of monocular RGB images has a significant drawback, as it incurs in a loss
of information that seems crucial for a full scene understanding.
In this thesis, we are striving to validate the estimated 3D poses in a Video with a fully
convolutional model over 2D Key points and advance an approach of optimization based on Human3.6M
and HumanEva-I data-sets. Our working tools are to study the traditional methods of the HBPE
alongside the mathematical functions (Loss and activation functions) with the prevalent use. The
activation functions in the main structure of CNNs plays a fundamental role; therefore, we will compare
the mathematical properties of different activation functions to find out the best one for our validation.
Last but not least, we have conveyed research on loss function to find out the most efficient one
for our purpose, and we updated the weights and biases regularly to decrease the value of loss function
(Back propagation).
To validate our approaches empirically, Python programming language version 3x with some
specific libraries and dependencies was our implementation environment, what we will investigate later
in the implementation chapter.
16
We must mention that 3D HBPE is a tough task, and its difficulty lies in the 6 degrees
of freedom in every human body joint. Therefore we have analyzed and studied the mathematical
background of the 3D-CNNs because these mathematical formulas support us to measure the geometry,
position, and motion of the human body in space.
0.2 Introduction
Human Body Pose Estimation (HBPE) is one of the fascinating research areas, which yet in its basic
growth steps, and it is still a mostly unfinished task. However, there is plenty of research that is done
by researchers. The pictorial structure models play a vital role investigating in the fields of HBPE
[13],[14]. The graphic structure models estimate the human body pose by building the human body
with collections of parts arranged in a deformable way. The tangible step was launching Microsoft
Kinect depth sensor cameras in 2010, which made dramatic changes in the field of computer vision.
Kinect cameras opened a new door to HBPE researchers by providing depth data [15]. Recently, CNNs
have made great success in deep learning tasks. Hence many researchers applied CNNs to get the best
results in the HBPE tasks.
Furthermore, image classification is the task of assigning an input image one label from a
fixed set of categories and this problem is one of the central tasks in computer vision that, despite its
simplicity has a vast variety of practical applications such as object detection and segmentation [7].
Human pose estimation (HBPE) in the field of computer vision is considered as the task of estimating
the configuration of a human body given an image. This configuration can be defined utilizing the
geometric location of a set of points, typically corresponding to the human joints (hands, arms, knees,
and etcetera) [6]. The approach of 3D poses estimation is nascent, and still in progress.
Most efficient models of 3D-Pose estimators are recently built on Two-step pose estima-
tion, where they employ the 2D keypoint detections and predicting their depth, which delivers better
performance than the other approaches of exploiting the intermediate of supervision, then back-project
to the input of 2D keypoints. From that point, the semi-supervised training method has already taken
a prominent part to tackle the scarce of the labeled videos in the datasets.
In this thesis, we focus on a method that combines deep learning algorithms with semantic
reasoning techniques to deliver an experimental validation for our proposals of optimizations with a
comparison of the best-delivered results so far.
Briefly, the main objects of this thesis are to:
1. compare and analyze the performance of the most significant activation functions and loss func-
tions in the fully- convolutional models.
2. We will improve the 3D pose estimation capabilities by studying the impact of the most critical
18
improvement techniques like Stacked hourglass network, Mask R-CNN, and cascaded pyramid
network (CPN) to combine them in a better way to attain more reliable results.
3. Finally, we will involve an examination on the main structure of CNNs and figure out the benefits
of increasing or decreasing the number of hidden layers in a CNN and it’s primary influence in
order to accurate the 3D pose predictivity.
Chapter 1
Theoretical Background
This chapter advances the fundamental concepts of Artificial Intelligence, which is related to our re-
search.
1.1 Artificial Intelligence
Since Artificial Intelligence has been coined by John McCarthy many decades ago, it became one of the
most elusive topics in Computer Science.
Artificial Intelligence (AI) is how to train the machine to simulate the human intelligence
processes such as learning, reasoning, and self-correction to be more cognitive, especially computer
systems.
In some works of literature [6], we demonstrate that the AI is defined as a general science and
engineering discipline which deals with the idea of creating intelligent machines.
Recently this field has gained much public consciousness, especially after Googles’s AlphaGo
defeated the south Korean master in the board game Go in early 2016. Also, the applications of AI
can be widely used in the human body pose estimation, Speech recognition, expert systems and many
more.
20
1.1.1 Machine Learning
Machine learning is a field of study as a branch of artificial intelligence that uses the statistical learning
methods and computer science principles to develop the machine intelligence, used to perform significant
tasks like predictions, inference, and those models are sets of mathematical relationships between the
inputs and outputs of a given system [42].
In machine learning, the learning process is the process of estimating the model’s parameters
so that the model can perform the specific task and to give the machines the ability to learn without
being programmed explicitly [44].
1.1.2 Deep Learning
Deep learning defined as hierarchical learning or deep structured learning, it is a branch of machine
learning that builds on the formerly known artificial neural networks (ANN) and extends the multilayer
perceptron (MLP) likewise known as feed-forward neural networks (FFNN) to flexible configuration
and a potentially high number of hidden layers [6].
We can also consider deep learning as a class of machine learning algorithms that can works
with structured and unstructured data as well for feature extraction and transformation so that every
succeeding layer employs the output from the prior layer as input and learn in supervised (e. g.,
classification) or unsupervised (e.g., pattern analysis) manner [4]. Deep learning is more recently being
referred to as representation learning in some literature [43].
Deep learning structures such as deep neural networks, recurrent neural networks and deep
belief networks have been widely applied to fields including computer vision (the academic field which
strives to gain a high-level understanding of the low-level information given by raw pixels from digital
images), natural language processing, speech recognition, social network filtering, audio recognition,
machine translation, bio-informatics, drug design, medical image analysis, material inspection, real-
time human body pose estimation in 2 dimensional and 3 dimensional form, material inspection and
board game programs, where they have generated events similar to and in some cases superior to human
experts [2], [3].
The most important thing about deep learning is that the advances in deep learning massively
depend on software such as Python, C++, Matlab and software libraries such as Theano, PyLearn,
21
PyTorch, Disbelief, Caffe, MXNet, Keras, Tensorflow and many more.
Artificial Intelligence
Machine Learning
Deep Learning
CNNs
Venn diagram for AI, deep learning and CNNs [5].
1.1.3 Deep learning Algorithms
As stated in [42], deep learning algorithms are multilevel representation learning techniques that allow
simple non-linear modules to transform representations from the abstract descriptions, with many
learned complex functions.
22
1.1.4 Artificial Neural Networks
Artificial neural networks (ANNs) also known as connectionist systems, are computing systems inspired
by the Biological Neural Networks (BNNs) that constitute human brains so that each neuron has the
job to communicate with one or many other neurons by using the weighted connections. Comparable
to synapses in biological systems, we can also say that ANNs are a software implementation of the
neuronal structure of our brain [5]. Such systems progressively improve their ability to do tasks by
considering examples, generally without any task-specific programming. For an example, in the image
recognition process, they could be trained to identify models that contain cats by examining example
images that have been manually labeled as ’CAT’ or ’NO CAT’ then using the analytic outcomes to
identify cats in other models.
We can train ANNs in a supervised, unsupervised, or semi-supervised method. In a supervised

ANN, the network is trained by providing matched input and output data samples to get the ANN to
give the desired output for a given input. As an example for this method, we can mention email spam.
Unsupervised learning in an ANN is an attempt to get the ANN to understand the structure
of the provided input data on its own.
To compare supervised and unsupervised methods, we find out that supervised deep learning
algorithms are trained on data-sets that include labels added on its data to guide the algorithm to
understand which features are essential to the given problem. However, unsupervised deep learning
algorithms are trained on unlabeled data and must determine feature importance on their own based
on an inherent pattern in the provided data.
Semi-supervised method conceptually can be positioned halfway between supervised and un-
supervised methods. Generally, a semi-supervised learning problem starts with a series of labeled data
points as well as some data points for which labels are not known. The goal of a semi-supervised model
is to classify some of the unlabeled data using the labeled information set [42]. Semi-supervised learning
is a form of supervised learning with some additional information. We will focus on supervised and
semi-supervised learning method.
23
 
1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0
 

 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 


 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 

 

 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 

0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 
 

 

 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 

 

 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 


 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 

 

 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 

0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 
 

 

 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 

 

 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 


 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 

 

 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 

0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 
 

 

 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 

 

 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 


 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 

 

 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 

0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 
 

 

 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 

 

 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 1 


 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 

0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1
A computer sees an image as a matrix of pixels each of which has three components, red, green, blue
[5].
Similar to organic neurons in a biological brain, an artificial neural network is based on a

collection of connected units called artificial neurons, and each of the associations between neurons
can forward the signal to another neuron. The receiving neuron can treat the signal and then signal
downstream neurons related to it. These artificial neurons may have states, generally represented by
real numbers, typically between 0 and 1 and they may likewise have a weight that alters since learning
proceeds, which can increase or decrease the potency of the signal that it sends downstream [7].
24
The main goal of ANN is to approximate a function ϕ that takes an input a sample x from
a data distribution for example a set of images containing bodies in different poses and maps it to its
corresponding label y for example indication whether there is a body or not, or where the body joints
are.
We can also say that the goal of ANN is to learn a function that corresponds y = ϕ(x) so that
ŷ = y.
When the labels ŷ correspond to categories, the function ϕ is known as classifier, whereas
when ŷ corresponds to a vector or a tensor of real numbers, the function ϕ is known as regression. The
most basic and common type of neural networks is the multilayer perceptron in which information is
forwarded in a single direction from the input to the output without any cycle or loop [6].
Figure 1.1: Biological and artificial neural networks [81]
25
1.1.5 Neuron
Neurons are nodes in a neural network. Neurons receive inputs from incoming connections from other
neurons, and potentially themselves weighted by a factor w. Each neuron computes the sum of received
and weighted signals and adds a bias term, passes it through a non-linear activation function and finally
forwards the output through weighted connections to other neurons.
1.1.6 Layers
All neurons in an ANN are structured into segments. Typically an ANN contains an input layer, an
output layer and some hidden layers in between. Each layer receives its input from previous layers and
passes its output as input to the subsequent layers.
The input layer consists of d nodes and the number d shows the dimensionality of the input
data-set that each takes a component of an input vector x and forward the processed output to the
subsequent layer. The hidden layers play the role of performing computing within the neural network.
The neurons number in a layer refers to the width since the number of hidden layers is addressed
the depth.
The computations of the incoming signals for a layer can be expressed in the form of a weighted
matrix W by concatenating the individual weights w.
1.1.7 Activation Function
Activation functions compute a scaled output of the weighted sum of the inputs.
Generally, activation functions are selected to be non-linear, and in principle, any other func-
tion could be used as an activation function, but due to practical experiences, a few distinct functions
have proven themselves of value.
Activation functions play a very significant role in our research. Therefore we will deeply study
them in chapter three of this thesis.
26
1.1.8 Weight Initialization
The weights and biases are set to initial values at the beginning of the neural network construction, and
the weight should not be universally set to zero because the backpropagated error and thus the gradient
will be identical for all neurons in the network since the output layers neurons show the same output.
To break the symmetry between all the weights, they are usually initialized to small randomized values,
keeping them close to zero.
1.1.9 Cost Function
A cost function is a form of how wrong the model is in terms of its capability to measure the relationship
between x and output y.
In every ANN for an input x, a prediction ŷ is computed at the output layer and evaluated
using ground truth y utilizing a cost function.
Sometimes cost functions are translated as the sum of loss function over the given training set
plus some model complexity penalty.
The cost function is calculated as:
N
1 X
E(W, b) = E(W, b, xn , yn )2 (1.1)
N n=1
so that N is the number of training examples.
In the following, we mention some of the standard cost functions, which are relatively related
to our research.
i. Sum of squared errors (SSE):
N
X
SSE(y, ŷ) = (yn − yˆn )2 (1.2)
n=1
27
ii. Root mean squared error (RMSE):
s
PN
n=1 (yn − yˆn )2
RM SE(y, ŷ) = (1.3)
N
iii. Categorical Cross Entropy (CCE):
N K
1 X X (k)
CCE(y, ŷ) = − −yn log(ŷn(k) ) (1.4)
N n=1
k=1
so that K is the number of classes. If K = 2 then for the case of binary CCE we get:
N
1 X
CCE(y, ŷ) = − yn · log(ŷn ) + (1 − yn ) · log(1 − ŷn ) (1.5)
N n=1
The CCE function has excellent properties, and these properties make it a suitable choice for
the use as an error function when we train a deep neural network.
1.1.10 Gradient Descent
The overall concept of gradient descent is to follow the gradients on the surface of a cost function for a
weight configuration of an ANN while attractively correcting the weights in the direction of the negative
gradient slop and minimizing the cost function E. That is possible for ANNs as they are constructed
of differentiable components, which also holds for commonly used cost functions [40].
At the beginning of the network construction, the weights w and biases b are initialized ran-
domly, and then the derivatives of the cost function E concerning all of the weight and biases are
updated by using the stochastic gradient descent toward the negative gradient.
The above information about gradient descent is enough for our research purpose, and our
interested readers can get more details in [41].
28
1.1.11 Deep Neural Networks
Deep neural networks (DNNs) are artificial neural networks (ANNs) with a certain level of complexity,
and they use sophisticated mathematical modeling to process data in complex ways. DNNs must have
more than two layers.
Deep neural networks (DNN) have been successfully used in diverse emerging domains to solve
complex real-world problems with may more deep learning architectures, being developed to date [3]
and deep neural networks use activation functions to perform various computations between the hidden
layers and the output layers of any given deep learning architecture. We will explain activation functions
in chapter three of this research.
1.2 Multilayer Neural Networks (MLNN)
The purest form of deep artificial neural network (DANN) is fed forward multilayer neural network
(FFMLNN). Like the human brain biological neurons, artificial neurons are the elementary building
blocks of artificial neural networks, and artificial neural networks (ANNs) receive information as input,
process it with the help of a function and transmit it through signals as output [6]. ANNs can receive
one or more than one input units at a time, and the inputs are usually weighted (w) with real numbers
to express the importance of the corresponding data to the output. In ANNs, it is also necessary to add
a constant value to the output, and this constant value is called bias (b). According to the biologists,
bias can be considered to be a measure of how easy it is to get a neuron to fire [7].
Generally, in biological neurons and in an artificial neuron, when the weighted input is received,
the neuron will perform the following operations:
i. Summation of the weighted input is received by the ANN, i. e:
n
X
wi xi = W · X
i=1
So that W · X is a dot product of input vector and the weight vector and n is the number of
input.
ii. Addition of bias (b) for the output result, i. e:
29
Input Hidden Hidden Output
layer layer 1 layer 2 layer
Input #1
Input #2 Output
Input #3
Figure 1.2: Multi-layer neural network [82].
30
Input Hidden Output
layer layer layer
Input #1 Output #1
Input #2 Output #2
Input #3 Output #3
Input #4 Output #4
Figure 1.3: A basic feed forward multi-layer neural network [82].
n
X
wi xi + b = W · X + b
i=1
iii. Application of a nonlinear activation function (in our research we used ReLU activation
function), i. e:
Xn
ReLU ( wi xi + b) = ReLU (W · X + b)
i=1
1.2.1 Feed forward multi-layer neural network (FFMLNN)
FFMLNN is a collection of ANNs which is organized in layers and is connected as a finite directed
acyclic graph [17], [18], so that the neurons belong to one layer serve as input feature for neurons in the
next layer [18]. Theoretically, in each hidden layer, a non-linear transformation of the input from the
previous layer must be computed. Therefore the more hidden layers ANNs has, the higher is
its ability to learn more complex functions, but we must not forget that more hidden layers make
the training procedure very difficult and more layers need more computer (CPU and GPU) power, we
must also remember that, engineers fix the number of layers in some algorithms and we can only use
the proper one.
31
Chapter 2
Convolutional Neural Networks
In this chapter, we will investigate some of the main constructions of convolutional neural networks,
their mathematical behavior, and how far those are necessary for our research.
2.1 CNNs Approach
Convolutional Neural Networks (CNNs) are multistage architectures inspired by how the human brain
processes visual information or we can consider CNNs as the biologically inspired version of multilayer
perceptrons (MLPs) for vision tasks [11]. CNN’s are specialized kind of artificial neural networks for
processing data that has a known grid-like topology [6].
Researchers build many different architectures for CNNs. However, the majority of them can
be built by stacking three main types of layers in a different combination, and these three layers are
the convolution layer, the pooling layer, and the fully connected layer. We explain each of these layers
as follows:
i. The Convolution Layer: This layer aims to learn feature representation of the input, and it
is composed of several feature maps. In feature maps, all neurons are connected to the neurons in the
previous layer, which is referred to the neurons receptive field in the previous layer [6]. The forward
pass, through the convolutional layer, means to compute feature maps from the input. We mean that
32
conv deconv
conv deconv
conv
conv
conv
Figure 2.1: A rough diagram for a Convolutional Neural Network [84].
we convolve input feature maps with learned filters, and then we apply the activation function to these
outputs. Generally, the filters are spatially small, but they have the same depth as the input. These
filters on the first layer of CNN may have 9 × 9 × 1 dimensions for depth images (9 pixels width, 9 pixels
height, and 1-pixel depth, because depth image has one channel).
We compute the feature maps by sliding the filters over input data to produce an activation
map that gives the responses of the filter at any spatial position.
ii. Pooling Layer: The essential concept in CNN is pooling. Pooling lowers the computational
burden by reducing the number of connections between convolutional layers. Pooling layers aim to
achieve spatial invariance by reducing the resolution of the feature maps. Typically pooling layers are
placed between two convolutional layers. Each feature map of pooling layer has the same number of
feature map because each of the feature maps of the pooling layer is connected to its corresponding
feature map of the proceeding convolutional layer.
The most common used pooling is average pooling [42] and max-pooling [6], and in our research,
we only use max-pooling because it controls the image whether given feature exists or not, and if it
exists, then it returns the precise position.
iii. Fully Connected Layers (FCL): These layers are responsible for higher-level reasoning in
CNNs, and they are usually used as the last layers of CNN. Fully connected layers take all nodes in the
previous layer and connect it to every single neuron that it has. Generally, the output of the last fully
connected layer will be fed to lose layers.
CNN’s uses convolution in place of general matrix multiplication in at least one of their layers.
The convolution operation is a specialized kind of linear operation, and generally, in mathematics, we
33
define convolution of two piece-wise functions x and w as follows:
Z t
S(t) = x(t) ∗ w(t) = x(a)w(t − a)da (2.1)
0
so that the convolution operation is typically denoted with an asterisk (∗) and, also w must be a valid
probability density function and w needs to be 0 for all negative arguments.
Equation (1) can also represented as:
∞
X
S(t) = x(t) ∗ w(t) = x(a)w(t − a) (2.2)
a=−∞
In deep learning applications the input is usually a multidimensional array of data and the kernel
(function w) is usually a multidimensional array (tensor) of parameters that are adopted by the learning
algorithm.
We can use convolution over more than one axis at a time. As an example, if we use a two
dimensional image I as our input, we probably also want to use a two dimensional kernel K, in that
case we get the following result:
XX
S(i, j) = I(i, j) ∗ K(i, j) = I(m, n)K(i − m, j − n) (2.3)
m n
Fortunately, the convolution operation is commutative, i. e:

XX
S(i, j) = K(i, j) ∗ I(i, j) = K(m, n)I(i − m, j − n) (2.4)
m n
Equation number (2.4) is more straightforward to implement in deep learning libraries, because there
are less variations in the range of valid values of m and n.
We can implement a related function called cross-correlation on neural networks, which is the
same as convolution but without flipping the kernel, i. e:
XX
S(i, j) = I(i, j) ∗ K(i, j) = K(m, n)I(i + m, j + n) (2.5)
m n
CNN’s have been tremendously successful in practical applications, and they are widely used in different
areas of deep learning, especially 2D and 3D human pose estimation, and in our work, we will try to
use both results, and we will call them convolution.
34
If we assume that we have a 4-dimensional tensor Ki,j,k,l giving the connection strength be-
tween a unit in channel i of the input and a unit in channel j of the output, with an offset of k rows
with l columns between the output unit and the input unit. Further on, we assume assuming that the
input consists of observing data Vi,j,k giving the values of the input unit within channel i at row j and
column k. If our output Zi,j,k is produced by convolution Ki,j,k across Vi,j,k without flipping Ki,j,k ,
then we get:
X
Zi,j,k = Vl,j+m−1,k+n−1 Ki,l,m,n (2.6)
l,m,n
So that summation over l, m, n is overall values for which the tensor indexing operations inside the sum-
mation is valid. Since we are using Python programming language throughout our research, therefore
the (−1) in equation number (2.6) is necessary because, in linear algebra notation, we index into arrays
using a (1) for the first entry, but in Python we index into arrays using (0).
For example, if we want to sample only every s pixels in each direction in the output, so as
defined in [6] we can write the down-sampled convolution function c as follows:
X
Zi,j,k = c(Ki,j,k , Vi,j,k , s) = [Vl,(j−1)×s+m,(k−1)×s+n Ki,l,m,n ] (2.7)
l,m,n
so that s is referred to as the stride of the mentioned down-sampled convolution.
We can also extend equation (2.6) to 6 dimensions as follows:

X
Zi,j,k = [Vl,j+m−1,k+n−1 Wi,j,k,l,m,n ] (2.8)
l,m,n
The above equation is sometimes also referred to as the not shared convolution. Because it is almost
similar operation to discrete convolution with a small kernel (W orK), but without sharing parameters
across the location.
The next important operation in our research is tiled convolution. To define tiled convolution
we consider K to be a 6 dimensional tensors, such that two of the dimensions are related to different
locations in the output map. Considering t as the output width we can write:
X
Zi,j,k = [Vl,j+m−1,k+n−1 Ki,l,m,n,j%t+1,k%t+1 ] (2.9)
l,m,n
where % is the modulo operation with the following properties:
t%t = 0
35
(t + 1)%t = 1
Now, to train a CN N that incorporates stride convolution of kernel stack Ki,j,k applied to multi-channel
image Vi,j,k with stride s as defined by c(Ki,j,k , Vi,j,k , s) in equation 2.7 and supposing that we would
like to minimize some loss function J(Vi,j,k ), Ki,j,k , then we will need to use c itself to output Zi,j,k
(During the forward propagation procedure), which is than propagated through the rest of the network
and used to compute the cost function J(Vi,j,k ), Ki,j,k ).
During back propagation (updating weights and biases to decrease loss function) we will receive
a tensor Gi,j,k so that:
∂
Gi,j,k = J(Vi,j,k , Ki,j,k ) (2.10)
∂Zi,j,k
To train the network we can use the following function:

∂ X
g(Gi,j,k,l , Vi,j,k,l , s) = J(Vi,j,k,l , Ki,j,k,l ) = Gi,m,n , Vj(m−1)·s+k,(n−1)·s+l (2.11)
∂Ki,j,k,l m,n
If this is not the bottom layer of the network, then we must compute the gradient with respect to
(Vi,j,k,l in order to back propagate the error further down and for this reason we can use the h function
which is defined as follows:
∂ XXX
h(Ki,j,k , Gi,j,k , s) = J(Vi,j,k , Ki,j,k ) = Kq,i,m,p Gq,l,n (2.12)
∂Vi,j,k n,p q l,m
so that j and k have the following values:
j = (l − 1) · s + m
k = (n − 1) · s + p
The most important feature of convolutional neural networks is that, it can be used to output high
dimensional structured objects, rather than only predicting a class label for the classification tasks or a
real value for regression tasks and typically these objects (outputs) are tensors, emitted by a standard
convolutional layer.
As an example, the model might emit a tensor S such like Si,j,k is the probability that pixel
(j, k) of the input to the network belong to the class i and this procedure allows the model to label every
pixel in an image and draw precise masks which follows the outlines of individual objects [42]. The
only issue that often comes up is that the output plane can be smaller than the input plane. In order
to produce an output map of similar size as our input, we can avoid pooling altogether [43]. Another
36
convolutional layer convolutional layer
fully connected layer
with non-linearities with non-linearities
layer l = 7
layer l = 1 layer l = 4
input image subsampling layer subsampling layer fully connected layer

layer l = 0 layer l = 3 layer l = 6 output layer l = 8
Figure 2.2: Architecture of a traditional convolutional neural network [85].

The designed architecture of the original convolutional neural network from LeCun et al. (1989),
fluctuates between convolutional layers. It shows that convolutional layers already include
non-linearities and the feature maps of the final subsampling layer, which includes an uncertain
number of fully connected layers.
strategy is to emit a lower resolution grid of labels, but this strategy is not relevant for our thesis, and
the reader can find clear and well-explained information in [42]. Finally, in principle, we can use a
pooling operator with unit stride.
Besides the above-mentioned strategies, another good strategy for pixel-wise labeling of images
is to produce an initial guess of the image labels, then refine this initial guess using the interactions
between neighbouring pixels and by repeating this refinement step several times corresponds to using
the same convolution at each stage, sharing weights between the last layers of the deep convolutional
neural network [18] and this procedure makes the sequence of computation performed by the successive
convolutional layers with weights which are shared across the layers a particular kind of recurrent neural
network [6].
2.1.1 Reason Behind Using CNNs
The reason why we use CNNs for the validation and optimization of 3D-HBPE is that, the MLNNs
are not suitable when dealing with visual information in general and HBPE in specific because the full
37
connectivity of the network leads to slow learning as the number of weights rapidly increases with the
higher dimensionality of HBP as visual input. The second disadvantage of MLNN is that every pair of
neurons between the two layers has its weight. Therefore we need a new architecture for the structure
of neural networks to help us to analyze human body pose, and the most well-known architecture is
CNN (Mask-R-CNN for our research).
CNN’s are feed-forward supervised deep neural networks [36], and they are a particular type
of MLNN that has comparably fewer connections and parameters and is easier to train. We can apply
CNNs to array of data, where nearby values are correlated, i. e. images, videos, sounds, time-frequency
representation, and RGB depth images. The most successful application of CNNs is the 2D and 3D
HBPE in real-time [45].
The type of CNNs that we are mostly interested in is the Krizhevsky’s architecture (AlexNet)
which has 7 layers [38] for theoretical reasons and Mask R-CNN [28], which has (243) number of layers,
for our practical research and they both achieved outstanding results on a large benchmark data-set
consisting of about one million images (ImageNet) [39] and we will implement them in Human3.6M for
learning and Eva-I and EVA-II data sets for testing reasons.
Both these CNNs use ReLU activation function because ReLU function speeds up training,
which enables to experiment with such large neural networks.
Generally, we can train our CNN by using stochastic gradient descent (SGD) with Softmax
function or any other function. We used SGD with L2 loss function.
2.2 Applications of CNNs in HBPE
In this part, we briefly introduce some applications of CNNs in HBPE.
2.2.1 Heat Map Models
In 2014 a group of scientists from New York University, based on Krizhevsk’s CNN architecture, pro-
posed a model which takes 3 levels of RGB Gaussian pyramids as input and produces a heat map
for each body joint, describing the per-pixel likelihood for that joint occurring in each output spatial
location, as the end output [35].
38
Similarly a group of researchers from Google, after predicting the heat maps of all joints
locations, used these predictions to crop out a centered window at the predicted joint locations from
the first two convolutional feature maps of each resolution. They kept the contextual size of the windows
constant by scaling the cropped area at each higher resolution level and then these feature maps were
propagated through a fine heat map model to produce an offset within the cropped sub-window.
Finally, the position refinement is added to the first predicted location producing a final 2D
location for each joint.
The fine heat map model is a Siamese network of instances corresponding to several joints
so that weights and biases of each module are shared. These convolutional sub networks are applied
to each joint independently because the sample location for each joint is different, and convolutional
features do not share the same spatial context.
The heat map model and fine heat map model are trained jointly by minimizing modified MSE
function between the predicted heat map and the target heat map, which is a two dimensional Gaussian
of constant variance centered at the ground truth joint location (x, y) (in our case (x, y, z)).
2.2.2 Deep Pose
Google researcher suggested this 2 - dimensional pose estimation, and they formulated the 2D-HBPE
as a joint regression problem and showed how to cast in CNN settings. The full RGB input image is
passed through 7 layers CNN to estimate the 2D location of each body joints. Predicted joints locations
are then refined by using higher-resolution sub-images as an input to a cascade of CNN based pose
prediction. This architecture is based on Krizhevsky’s CNN, but the loss function is different. i. e.
instead of a classification loss, linear regression is trained on top of the last CNN’s layer by minimizing
Euclidean distance between the prediction and the right pose.
2.2.3 Pose regression CNN
A joint group of researchers presented this CNN architecture from the University of Oxford and the
University of Leeds, and the goal of their work was to track the 2D upper human body pose over
long gesture rides. This architecture is very similar to the deep pose, except for the input layer where
multiple frames are inserted into the data layer color channels.
39
2.2.4 CNNs for binary classification
This architecture was designed by A. Jain in 2013, and it is designed to perform independent binary
body part classification with one network per feature. The input of these networks must be 64 × 64
pixel RGB images patches with the applied locally connected network. Besides this, the CNNs are
implemented as sliding windows to an overlapping region of the input. A window of pixels is mapped to
a single binary output, representing the probability of the body part being present in the patch. This
an approach enables us to use of much smaller CNNs and retain the features of pooling at the expense
of having a set of maintained separated parameters for each body part.
Since a series of independent parts detectors cannot enforce consistency in a pose in the same
way as a structured output model, which produces valid full-body configuration, therefore after training
the CNN with standard batch stochastic gradient descent (SGD), a method enforcing pose consistency
using parent-child relationships is applied. Beside the mentioned CNN architectures, there are many
other ones available, and all these different architectures together play a fundamental role in the field
of HBPE. We only mention the name of some of them as follows:
LeNet, AlexNet, VGGNet, GoogleNet, ResNet, ZFNet, etc.
In our research, we are only focus on AlexNet for theoretical reasons and Mask-R-CNN archi-
tecture for our practical use.
2.2.5 Recurrent Neural Network (RNN)
Recurrent Neural Network (RNN) is a neural sequence model that achieves state of the art performance
on essential tasks that include language modeling [6], speech recognition [18], and machine translation.
It is known that proper applications of neural networks require proper regularization. Unfortunately,
dropout, the most potent regularization method for feed-forward neural networks, does not work well
with Recurrent neural networks. As a result the practical applications of RNNs often use models that
are too small while big RNNs seems to over-fit. The existing regularization methods give relatively
small improvements for RNNs.
In this work, some times, we only compare our model with the RNN results, and the RNNs
are not part of our research.
40
yt−2 yt−1 yt yt+1 yt+2
6 6 6 6 6
- - - - - -
6 6 6 6 6
- - - - - -
6 6 6 6 6
xt−2 xt−1 xt xt+1 xt+2
Figure 2.3: Regularized multilayer RNN [85]

The dashed arrows shows connections where dropouts are applied over it, and the solid lines indicate
connections where dropout is not applied [18].
2.2.6 Seq2Seq Model
Seq2Seq model: This model is introduced by researchers in google and revolutionized the process of
translation by making use of artificial intelligence, and currently this model is used for different fields
such as image capturing, conversational models and test summarization. It takes an input of a sequence
of words or sentences and generates an output sequence of words by use of the RNN. The point to be
noted is that, the vanilla version of RNN is rarely used because RNN suffers from the problem of
vanishing gradients [80].
Seq2Seq model mainly has two components. Encoder and decoder, therefore sometimes this
model is called encoder-decoder network.
The encoder uses deep neural network layers and converts the input words to corresponding
hidden vectors.
The decoder is similar to the encoder and takes as input the hidden vector which is generated
by the encoder and it’s hidden states and current word to produce the next hidden vector and finally
41
predict the next word.
Figure 2.4: seq2seq model [86]
As we mentioned Seq2Seq model works in the base of RNNs and in our research, we are not
interested in RNNs, but sometimes we need to compare our model with the results which are generated
by RNNs. Therefore this model is essential for us.
42
Chapter 3
Activation Functions
In this part, we will present a brief overview of some of the existing activation functions which are used
in deep learning applications and highlight the recent trends in the use of the activation functions for
our research.
This chapter will help us in taking effective decisions in the choice of the most suitable and
appropriate activation function for our research.
3.1 What is an Activation Functions?
In deep learning, activation functions are non-linear functions that convert the input value to the node’s
output value. We can also define activation functions as functions which are used in artificial neural
networks to compute the weighted sum of input and biases, which is used to decide if a neuron can be
fired or not [42].
Activation functions manipulate the presented data through some gradient processing usually
gradient descent and afterward they produce an output for the ANN, that contains the parameter in
the data and some literature these activation functions are referred to as a transfer function.
Activation functions can be some times linear or mostly non-linear and they are used to control
the output of an ANN across different domains from human body pose estimation, object recognition
43
Bias
b
x1 w1
Activate
function Output
Inputs x2 w2
Σ f y
x3 w3
Weights
Figure 3.1: An artificial neural network and the function f is the activation part [84].
and classification [47], [49], [60], to speech recognition [62], [63], segmentation [64], [65] and many more.
The most crucial feature of activation functions is that the proper choice of it improves results
in our CNN computing. Therefore it is an essential task to select the best activation function for our
research purpose. So as a result we can say that activation functions are transfer functions that are
applied to the output of the linear models to produce the transformed non-linear output, ready for
further processing and since the output of an ANN is linear in nature, therefore non-linear activation
functions are required to convert the linear input to non-linear output. The position of an activation
function in a network structure depends on its function in the ANN, and thus when the activation
function is placed after the hidden layers, it converts the learned linear mapping into non-linear forms
for propagation while in the output layer, it performs predictions [67]. During the training of CNNs,
we need to compute the gradient of output with respect to weights and input.
The most commonly used activation functions are the logistic type functions and the most
important logistic functions are Sigmoid, Tanh, ReLU, Leaky ReLU and Softmax and in our research
we will focus mainly to know the properties of ReLU, Leaky ReLU, and Sigmoid.
3.1.1 Sigmoid Function
Sigmoid function is also called a logistic or squashing function [52], mathematically can be defined as:
1
sig(x) =
1 − e−x
44
1.0 y
f (x) = 1+e1−5x
g(x) = 1+e1−10x
0.8
0.6
0.4
0.2
x
−1.0 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1.0
Figure 3.2: Two examples of sigmoid function [87].
The sigmoid function has the following properties:
i. High positive values converge to 1 and high negative values converge to 0.

ii. This function is strictly increasing function between 0 and 1.
iii. Sigmoid function squashes the input into a range between 0 and 1.
iv. This function exhibits a balance between non-linear and linear behavior.
v. Unfortunately, this function saturates and kills gradients, therefore in most CNNs, we do
not use it because when neurons output converges either to 0 or 1, the local gradient at
these regions becomes almost zero and this will prevent the higher-level layers training.
Hence local gradient is very crucial for back-propagation [46].
vi. The neurons at higher layers of the CNN will receive the data that is not zero centered,
since sigmoid function’s output is not zero centered, and this is very harmful to the
performance of the CNN. These CNNs will get the data, which is always positive, then
the gradient on the CNN will be either all positive or all negative in the backward pass
which causes jittering for the updates of the weights during training with small batch
size.
As shown in Fig 4.2, the sigmoid function can moves from 0 to 1 when the input x is more
45
significant than a particular value. More important for the function is smoothness, and we can take the
derivative of it, and this is very important for the training algorithms.
There exist three variants of the sigmoid function, which are widely used in deep learning
applications and they are listed as follows:
Hard Sigmoid Function
The hard sigmoid function is mathematically defined as follows:
x+1
f (x) = max(0, min(1, )) (3.1)
2
One of the best properties of the Hard Sigmoid comparing to Soft Sigmoid function is that,
the Hard Sigmoid offers lesser computation cost when implemented both in a specialized hardware or
software form and it shows some promising results on deep learning-based binary classification tasks
[69].
Sigmoid Weighted Linear Units (SiLU)
SiLU is a reinforcement learning based approximation function [70] and it can be defined as follows:
ak (x) = yk α (yk ) (3.2)
so that
n
X
yk = wik xi + bk (3.3)
i=1
where x is the input vector, yk is the input to hidden units k, bk is the bias and wik is the
weight connecting to the hidden units k respectively.
46
The SiLU activation function can only be used in the hidden layers of the deep neural networks
and only for reinforcement learning-based systems. Also the SiLU activation function outperform the
ReLU function.
Derivative of Sigmoid-Weighted Linear Units (dSiLU )
dSiLU activation function is the gradient of the SiLU function and it is used for gradient decent
learning updates for an ANN weight parameters. dSiLU function can be defined as follows:
ak (x) = α (yk )(1 − α (yk )) (3.4)
The authors in [70] highlighted that the dSiLU outperformed the standard Sigmoid function
significantly.
3.1.2 Tangent Hyperbolic (Tanh)
The Tanh activation function is widely used in RNNs and it gives better training performance for
MLNNs [79],[71], [72], [49].
The tanh function is defined as follows:

ex −e−x
sinh(x) 2 ex − e−x
tanh(x) = = ex +e−x
=
cosh(x) 2
ex + e−x
This function has the following mathematical properties:
i. This function squashes the input.
ii. This function is strictly increasing between 0 and 1.
iii. Tanh function’s output is zero centered, and because of this property, it performs better
than sigmoid function in practice.
iv. Unfortunately, similar to the sigmoid function, it also has the output saturation problem.
47
3.1.3 Hard Hyperbolic Function (Hard Tanh)
This activation function represents a cheaper and more computational efficient version of Tanh, and
this function lies, between the range of -1 to 1.
Mathematically Hard Tanh can be be defined as:



 −1 if x < −1

f (x) = x if − 1 = x ≤ 1



1 if x > 1
3.1.4 Rectifier Linear Unit Function (ReLU)
The most successful and widely used activation function [57] which guarantee faster computation since
it does not compute exponential and divisions, with the overall speed of computation enhanced [58].
The next property of the ReLU is that it introduces sparsity in the hidden units as it squashes the
values between zero to maximum. ReLU function can be defined as follows [46]:
ReLU (x) = max(0, x)
ReLU activation function has the following properties [18]:
1. The output range of ReLU function is [0, ∞).
2. The best property of this function is, it does not vanish the gradient, and because of this property
ReLU function is widely used in CNNs.
3. It can significantly accelerate the convergence of stochastic gradient descent compared to sigmoid
and tanh functions.
4. Unlike sigmoid and tanh, ReLU function does not require any exponential computation. Therefore
it can be computed very efficiently.
5. This function is not differentiable at zero and also it is unbounded. The gradients for negative
input are zero, which means for activations in that region, the weights are not updated during
back-propagation. This procedure can create dead neurons which never get activated.
48
6
−8 −6 −4 −2 0 2 4 6 8
Figure 3.3: ReLU activation function [87].
As we mentioned in the property (v) of ReLU function, this function can quickly become weak
or die during the training of our CNN. If a large gradient goes through a ReLU activation function, it
could cause the weights to update in such a way that, the artificial neuron will never activate on any
data point again, and if such a thing happens, then the gradient which is going through the unit will
forever be zero from that point on and in this way the ReLU units can irreversibly die during the CNN
training, since they can get knocked off the data manifold.
If the learning rate is set to be too high it can be possible that many neurons were never
activated across the entire training data-set but with a proper setting of the learning rate or mathe-
matically we can say that, if the derivative is precisely zero for inputs then we get dying ReLU and this
is less frequently an issue.
To overcome this kind of problems, a group of scientists from Microsoft in 2015 introduced the
parametric rectified linear unit (PReLU), also known as leaky ReLU [6].
3.1.5 Leaky ReLU or Parametric Rectified Linear Unit (PReLU)
This activation function introduces some small negative slope to the ReLU to sustain and keep the
weight updates alive during the entire propagation process [71]. The p parameter was introduced as a
solution to the ReLU dead neuron problems so that the gradients will not be zero at any time during
49
training.
Leaky ReLU computes the gradient with a minimal constant value for the negative gradient p
in the range of 0.1.
Figure 3.4: Leaky ReLU Activation function [87]
In ReLU function, instead of the activation function to be zero for all values of x < 0, a PReLU
function will instead has a small negative slope p so that:

x if x > 0
P ReLU (x) =
px if x ≤ 0
The coefficient p can be manually set to a small value or adaptively can be learned. Some re-
searchers reported success with this form of activation function, but the results are not always consistent
[6].
For our interested readers, we recommend the use of this activation function, if they set to
p = 0.01.
50
In general for i number of activation functions, the Leaky-ReLU activation function is defined
as:

xi if xi > 0
f (xi ) =
p x
i i if xi ≤ 0
Where pi is the negative slope controlling parameter and its learnable during training with
backpropagation. If pi = 0, then PReLU becomes ReLU.
For simplicity we can also write the PReLU as:
f (xi ) = max(0, xi ) + pi min(0, xi ) (3.5)
In [21], the authors showed that the performance of PReLU is better than ReLU in large scale
image recognition, and because of this property, we are interested in this activation function through
our research.
3.1.6 Randomized Leaky ReLU (RReLU)
This function is defined as:

xji if xji ≥ 0
f (xi ) =
p xji if xji < 0
ji
Where pi U (l, u), l < u and l, u ∈ [0, 1]. RReLU is a dynamic variant of Leaky ReLU,
where a random number sampled from a uniform distribution U [l, u] is used to train the neural network
[74].
3.1.7 S-Shaped ReLU (SReLU)
This function is defined as:
51

tri + pri (xi − tri ) if

 xi ≥ tri

f (x) = xi if tli < xi < tri


tl + pl (x − tl ) if

xi ≤ tli
i i i i
So that tli , tri and pri are learnable parameters of the network and the i indicates that this
function can vary in different channels [75]. Also, pri shows the slope of the right line with input above
the set threshold of tri and tli are thresholds in positive and negative directions.
The authors in [75] highlighted that SReLU was tested on some famous CNN architectures like
GoogLeNet, CIFAR-10, ImageNet, and MNIST data-set, and it showed improved results. Therefore,
we worked with this activation function, and we studied it for our theoretical research.
3.1.8 Softplus Function
This activation function is defined as [76]:
f (x) = log(1 + ex ) (3.6)
3.1.9 Exponential Linear Unit (ELU)
This function is defined as [77]:

x if x > 0
f (x) =
p ex − 1 if x ≤ 0
Where p = ELU parameter that controls the saturation point for negative net inputs, which
is usually set to 1.0.
52
3.1.10 Maxout Activation Function
This function is given as [78]:
f (x) = max(w1T + b1 , w2T + b2 ) (3.7)
So that w is the weights, b is the biases, and T shows the matrix transpose operation.
3.1.11 SoftMax
This activation function can be defined as:
exi
f (xi ) = Pn (3.8)
j=1 exj
Softmax is used to compute probability distribution from a vector of real numbers.
This function produces an output between the range of 0 and 1 with the sum of the probabilities
equal to 1.
We can use Softmax in multi-class models where it returns probabilities of each class with the
target class having the highest probability.
Softmax mostly appears in almost all the output layers of deep learning architectures [47],
[64], [72]. If we compare Softmax with Sigmoid, we can find out that the Sigmoid is used in binary
classification while the Softmax is used for various classification tasks.
3.1.12 Softsign
This activation function is defined as follows [52]:
x
f (x) = (3.9)
1 + |x|
53
If we compare Softsign function with Tanh activation function, we will see the Softsign con-
verges in a polynomial form unlike Tanh, which converges exponentially and this property can be an
advantage in run time.
Softsign has been widely used in regression computation problem and also has been used in
deep learning-based text to the speech systems, which is out of our research interest.
In conclusion, we like to mention that besides the introduced activation functions, there are
many other activation functions, but for our research, it was fully sufficient and necessary to use ReLU,
Leaky-ReLU, and Sigmoid.
Now we collect the most relevant results of this chapter in the following table to show the state
of the art algorithms of deep neural networks that emerged as a significant improvement to the existing
architectures and showing the position of activation functions which is used in those architectures.
Deep Learning Archetectures Hidden Layers Output Layers

SeNet ReLU Sigmoid
MobileNets ReLU Softmax
ResNext ReLU Softmax
ResNet ReLU Softmax
SqueezeNet ReLU Softmax
GoogleNet ReLU Softmax
SegNet ReLU Softmax
VGGNet ReLU Softmax
ZFNet ReLU Softmax
NiN No Activation Function Softmax
AlexNet ReLU Softmax
Mask-R-CNN ReLU Softmax
Table 3.1: Existing architectures and their activation functions
As we see from the above table, most of the existing architectures are using ReLU activation
function, and we are going to use Leaky ReLU with the Mask-R-CNN to hopefully get a better result.
We will learn more about the result of our experiment in chapter five.
The following figure shows the derivatives of some of the well-known activation functions. The
derivative of ReLU also exist in this figure, and we will learn more about it in chapter five.
54
Figure 3.5: The Derivatives of the most common activation functions [90]
The following figures show the graph of some of the essential activation functions. We will also
use the sigmoid activation function in our research, and we will outline the results of using the sigmoid
function in chapter five.
55
1.5
Logistic sigmoid Hyperbolic tangent
1
tanh(z)
σ(z)
0.5
−1
0
−4 −2 0 2 4 −4 −2 0 2 4
z z
(a) Logistic sigmoid activation function. (b) Hyperbolic tangent activation function.
1.5 1.5
Softsign Rectified tanh
1 1
| tanh(z)|
s(z)
0.5 0.5
0 0
−4 −2 0 2 4 −4 −2 0 2 4
z z
(c) Logistic sigmoid activation function. (d) Rectified hyperbolic tangent activation function.
Figure 3.6: Sigmoidal activation functions [88].

Commonly used activation functions include the logistic sigmoid σ(z) and the hyperbolic tangent
tanh(z). More recently used activation functions are the soft sign and the rectified hyperbolic tangent.
56
Chapter 4
Data-Sets
This chapter will overview the data-sets, which we focused on them through our research, and the
mathematical properties of loss functions will be discussed as well.
4.1 Features of Data-sets
In order to train a convolutional neural network (CNN) for HBPE task, we need large data-sets with
annotated ground truth information, and each data-set must have the following features and require-
ments:
(a) The data must be in video format so that a single person performs different poses.
(b) Each video frame should be annotated with ground truth full-body joint positions in 3D
Camera coordinate systems.
(c) The resolution of images should be large enough to obtain a proper bounding box of the
human body.
(d) The resolution of images should be large enough to obtain a proper bounding box of the
human body.
(e) The data-set should be download-able and available for research purposes.
(f) The data-set should contain a large number of different poses taken from different persons.
57
(g) It will be perfect if actions and poses were very different.
(h) The data-set must contain a large number of complete video sequences.
(i) It also should contain a large number of different camera views.
(j) Data-sets should have the possibilities for the user to obtain bounding boxes of the human
body quickly.
The point to be noted is that, the selection of a data-set for training CNN models to meet our
research requirement is not an easy task and the choice of training data was crucial and had to be well
thought of beforehand.
To select a data-set. it is essential to analyse, arrange, extract, and prepossesses the selected
data-set format for our CNN training task, and of course, this process needs much time. However, it
would be better to use this time and select the best data-set otherwise; it would be very undesirable
to change the data-set selection later. We will mention some of the most well-known data-sets in the
following table:
Data-set Year Resolution Camera Subjects Sequences Scenarios Joints

Human3.6M 2014 1000x1000 4 10 1200 15 32
Berkeley MHAD 2013 640x480 2 12 660 11 43
HumanEva 2007 640x480 3 4 56 6 15
MP 108 2010 1004x1004 8 4 54 4 22
INRIA RGBD 2015 640x480 1 1 12 - 15
CMU-MMAC 2009 1024x768 3 43 645 5 22
Cornell Activity 2009 320x240 1 8 180 22 11
Microsoft COCO 2015 640x480 - - - - -
Table 4.1: Famous Publicly available data-sets [4].
Considering table 4.1, we only explain the data-sets which are used in our research
4.1.1 Human 3.6M
Human 3.6M contains 3.6 Million 3D human body poses and corresponding images and till now is the
largest publicly available motion capture data-set. This data-set consists of high resolution 50 Hz video
58
sequences with 1000 × 1000 pixel resolution taken from 4 calibrated cameras capturing, 11 professional
actors (six male and five female) performing 17 different actions such like discussion, smoking, taking
photos, talking on the phone, etc. The 3D ground truth joint location is provided in the camera
coordinate system. Besides this, the data-set provides the bounding boxes of human bodies. The
ground truth of the data for three subjects is withheld and used for results evolution of the server.
These advantages determined that we used this data-set to be selected for our research.
4.1.2 Human Eva
The Human Eva data-set contains training data of 56 color video sequences of 640 × 480 resolution
capturing four subjects, performing six predefined actions (walking, jogging, gesturing, etc.) in 3
repetitions. In addition, this data-set contains seven calibrated video sequences (4 grayscale and three
colors) that are synchronized with 3D body poses obtained from a motion capture system. There are
almost 14000 synchronized video frames and ground truth 3D joint locations available for training and
validation and approximately 30000 for testing.
Regarding the properties as mentioned earlier, this data-set is too small for training a CNN
however, appealing for testing as there is a Group of researchers reporting 3D pose estimation by using
this data-set.
During the practical part of our research, we have continuously used this data-set and the
content of this data-set plays an essential role in our research project.
4.1.3 COCO Data-set
The Microsoft-COCO data-set is a large scale object detection, segmentation and captioning data-
set which has several features such as object segmentation, recognition in context, superpixel stuff
segmentation, 330000 images, more than 200000 labeled images, 1500000 object instances, 80 object
categories, 91 stuff categories, five captions per image, 250000 people with key points, 80000 training
images and 40000 validation images.
59
4.2 Objective Functions
3D human body pose estimation (HBPE) in real-time is a task of estimating joint locations. For this
reason, it is necessary and sufficient to calculate the loss between the predicted joint location and the
ground truth. During the training of CNNs, we must compute the gradient of the objective function
concerning the CNNs parameters. There are many objective functions in literature, but in our work,
we only use L2 loss function and to make it easier for the reader, we compare L2 loss function with L1
loss function.
4.2.1 Least Absolute Deviations (L1 ) Loss Function
Mathematically L1 loss function is formulated as follows:

n
X
L1 (x) = (|yi − h(xi )|)
i=0
n
X yi − h(xi )
∇L1 (x) = ∇h(xi )
i=0
|yi − h(xi )|
so that yi stands for true values and h(xi ) is stands for predicted values.
4.2.2 Least Square Errors (L2 ) Loss Function
This function can be formulated as follows:
n
X
L2 (x) = (yi − h(xi ))2
i=0
n
X
∇L2 (x) = 2 · (yi − h(xi ))∇h(xi )
i=0
such that x represents the input and y represents the output of the neural network (h).
Now, the question is how to decide between L1 (x) and L2 (x) function? In general, L2 (x) is
preferred in most of the cases because it has the best property of differentiability. As mentioned above
in our research we only use L2 (x) loss function and fore more information to our readers we summarize
the features of L1 (x) and L2 (x) functions in the following table:
60
L1(x) Loss Function L2(x) Loss Function
Robust Not very robust
Unstable solution Stable solution
Possibly multiple solution Always one solution
Computational inefficient on nonsparse cases Efficient analytic solution
Sparse output Non-sparse output
Built in feature selection No feature selection
Table 4.2: This table summarizes the features of L1 and L2 functions [91].
61
Chapter 5
Implementation
In this chapter, we operated the empirical implementation of our approach to leverage the experimental
results of the discussed validations and optimizations with the encountered problems and absorbed
lessons of the implementation. The application has been run several times with different activation
functions under diverse circumstances, which will be discussed and demonstrated in the next chapter.
5.1 Environment
The environment of settings and implementation was under the Linux operating system (Ubuntu 16.04).
The foremost software tool was Python 3.7 precisely Anaconda 3.7 as distribution of the python pro-
gramming language, where the large-scale data processing is one of its essential advantages, which
enabled us to overcome some of the encountered problems we will be describing later. Python was run
in PyCharm as an integrated development environment.
62
5.2 Used Libraries and Packages
5.2.1 PyTorch
PyTorch is an open-source library for Machine Learning built for Python. The advantages in comparison
to the other libraries is it’s efficiency with python, and it includes many layers and loss functions, which
intensifies our processes of validations and optimization. [29]
5.2.2 NumPy
NumPy as a fundamental library for scientific computing, is needed mainly for the deep learning of the
Human-body-poses estimations. It provides a multidimensional array object, which can grow dynami-
cally.[30]
5.2.3 H5py
H5py package helps us with the Human36M data-set to store and manipulate the enormous amounts
of numerical data. It enables us use it flexibly in the wanted purpose and performs similarly to a real
NumPy array. [31]
5.2.4 Matplotlib
Matplotlib is a library to assist us in visualizing the predictions of the trained datasets. Additionally,
you need the dependencies *ffmpeg* to export MP4 videos, and *Pillow* to export GIFs.
5.3 Additional supporting methods
Freshly has some frameworks and structured prospered in order to improve the pose estimations and
handle some of the keypoints detection issues, which we will utilize them in our implementation.
63
5.3.1 Mask R-CNN
The Mask Region-CNN is a developed framework from the Facebook AI Research [92] to support branch
of the bounding box recognition by adding a branch of segmentation masks for each generated instance.
It is easy to train and induce other tasks.
5.3.2 Stacked hourglass
It is a popular method for pose estimation with down- and up-sampling the modules with continuing
connections to enhance the performance of the body pose estimation[93].
5.3.3 Cascaded Pyramids Network (CPN)
It is a presented network structure [93] to solve the issue of detecting the hard keypoints by localizing
the simple keypoints and integrate all levels of features images to detect the hard keypoints like the
occluded or hidden keypoints.
5.4 Preparing the datasets
We performed our implementation based on two Datasets:
5.4.1 Human3.6M
Human3.6M is "a large-scale Data-set and predictive Methods for 3D Human Sensing in Natural Envi-
ronments".[32] We downloaded a pre-processed data-set [Provided by Martinez et al] to avoid the most
encountered problems while using MATLAB to convert it from its original format. Then it has been
prepared in the Python environment to end up with two files: one for the 3D poses, and the other for
the ground-truth 2D poses.
64
5.4.2 HumanEva-I
HumanEva-I is a data-set based on seven calibrated Videos, which contains four subjects [33]. We
downloaded the original data-set with the released code and critical updates. In the MATLAB envi-
ronment, the conversion script was run to produce a directory, which contains the converted 2D/3D
ground-truth poses on the 15-joint skeleton. Finally, we ran the python script to generate two files for
the 3D poses, and the ground-truth 2D poses.
5.5 Implementation and visualization
We depended in our implantation on the presented code of pavllo:videopose3d:2019 [45].
5.5.1 Supervised
With the Human3.6M data-set the evaluation was run from scratch. We trained a new model for
80 epochs to produce a top-performing baseline of 46.8 mm, we used bounding boxes from Mask R-
CNN with a ResNet-101-FPN with its reference implementation in Detectron, as well as the fine-tuned
cascaded pyramid network CPN. The architecture has been adjusted with a receptive field of 243
frames.[34].
Testing with HumanEva-I is unlike to H3.6M data-set, because it is much smaller. Therefore, we used
a Mask R-CNN detection with pre-trained models 2D on COCO and performed the training for 1000
epochs producing 33.0 mm on 3 actions (Walk, Jog, Box), and architecture with a receptive field of 27
frames. It is essential to be noticed that some frames in HumanEva are corrupted and overcome by
discarding them and grouping the valid sequences into adjacent chunks.
5.5.2 Semi-Supervised
The Semi-supervised Training is only supported and implemented for Human3.6M data-set. The Semi-
Supervised approach employs the unlabeled Videos with a combination of ground-truth 2D poses as
input. The training is supervised on just 10% of subject 1 in H3.6M data-set, and the other subjects
are processed and labeled as unlabeled data for the Semi-Supervision.
65
5.5.3 Visualization
In this settings, we could render a specified video, and generate a visualization in GIF form, which
comprises three main view-ports: the 2D input keypoints, 3D reconstruction, and the 3D ground-truth.
In the case of visualizing unlabeled videos, the 3D ground-truth will not be displayed. The difference
between Visualization with Human3.6M and HumanEva-I is that Human3.6M performs out of the box,
but Videos with HumanEva must be segmented and the corrupted frames dropped.[45]
5.6 Challenges and faced problems
The preparation and installation of the main tools of deep learning to outperform validation and opti-
mization tests were not easy. There are unfortunately no straightforward solutions for implementation,
in addition to the lack of resources for 3D-Human-Pose estimations.
Handling a large-scale data-set requires a lot of time and a high-performance Computer in terms of
CPU and GPU as well, what could shorten the time of our several tests.
Thus, the consumed time to find the most appropriate tools and technologies, which can serve our
research to produce the best attainable results of validations.
Furthermore, the confronted troubles with installing and processing the required packages in Python-
platform were a critical obstacle. For instance, installing PyTorch library and preparing the datasets
within Pyton3.7 environment, which has been resolved by running within Anaconda 3.7. There was
also a continuous error with testing the datasets, that the object array cannot be loaded, which was
solved by installing another version of Numpy package.
Ultimately, the minor challenge was by adjusting and adapting the script in the various tests with our
different suggested validations.
66
Chapter 6
Evaluation
In this chapter, we present the results of our experimental implementation and compare it with the
most commons outcomes. Latterly research from [45] has been published with remarkable results, which
frequently outperforms the best previous performing methods, and made our job more challenging. For
that reason, we will compare our outcomes with those results[45].
Initially, we provide the tables of the reconstruction errors with our selected data-sets in
both stat-of-arts supervised and semi-supervised settings. We did our tests several times in different
environments alternating our optimizations tools, and thence we only picked the best results up.
6.1 Reconstruction errors
The reconstruction errors are estimated in three different evaluation protocols as follow:
1. Protocol 1 shows the mean per joint position error (MPJPE) in millimeter, which is the mean
Euclidean distance between predicted joint position and the ground truth joint position.
2. Protocol 2 reports the error after alignment with the ground truth in translation, rotation, and
scale (P-MPJPE).
3. Protocol 3 aligns predicted poses with the ground truth only in scale (N-MPJPE) for the semi-
67
supervised experiment.
In addition to the three main position errors, measuring joint velocity errors (MPJVE) are also crucial
for videos to estimate the smoothness of the generated pose sequence [97].
6.2 Supervised Training
We alternate in this training model between our three proposed activation function with the loss function
L2 , where we follow the two-step technique with ResNet-101 for the Mask R-CNN [96] by utilizing the
temporal information to tackle the detections issues with CNN [95,45]. Ultimately compare those result
with the given in [45] taking into account that they ran all of their training based on Relu activation
function.
The next demonstrated results are only the results with the lowest reconstructions errors with
a description of the conditions that the training has been run init.
6.2.1 Reconstruction error on Human3.6M data-set
# Dir Disc Eat Greet Phone Photo Pose Purch Sit SitD Smoke Wait WalkD Walk WalkT Avg
[45], Single frame 47.1 50.6 49 51.8 53.6 61.4 49.4 47.4 59.3 67.4 52.4 49.5 55.3 39.5 42.7 51.8
[45], 243 frame, causal conv 45.9 48.5 44.3 47.8 51.9 57.8 46.2 45.6 59.9 68.5 50.6 46.4 51 34.5 35.4 49
[45] $, 243 frame, full conv 45.2 46.7 43.3 45.6 48.1 55.1 44.6 44.3 57.3 65.8 47.1 44 49 32.8 33.9 46.8
[45] $ § , 243 frame , full conv 45.1 47.4 42 46 49.1 56.7 44.5 44.4 57.2 66.1 47.5 44.8 49.2 32.6 34.6 47.1
Ours * $ §, with Leaky ReLU 51.9 57.8 53.7 55.5 56.5 64.3 52.9 60.2 69.4 84.8 55.5 52.5 65.4 38.8 40.4 57.3
Ours with Sigmoid 1917 1958 1938 1925 1948 1936 1923 2001 1980 2030 1953 1942 1985 1936 1925 1953
Ours + $,optimized with Leaky Relu 45.6 46.7 42.8 45.7 48.6 56 44.2 44.1 57 64.5 47.5 44.5 49.6 32.7 33.9 46.9
Table 6.1: Protocol 1: Compare the reconstruction error (MPJPE) of [45] with our optimizations.
68
[45] , Single frame 36 38.7 38 41.7 40.1 45.9 37.1 35.4 46.8 53.4 41.4 36.9 43.1 30.3 34.8 40
[45], 243 frame, causal conv 35.1 37.7 36.1 38.8 38.5 44.7 35.4 34.7 46.7 53.9 39.6 35.4 39.4 27.3 28.6 38.1
[45] $ , 243 frame, full conv 34.1 36.1 34.4 37.2 36.4 42.2 34.4 33.6 45 52.5 37.4 33.8 37.8 25.6 27.3 36.5
[45] $ §, 243 frame, full conv 32.2 36.8 33.9 37.5 37.1 43.2 34.4 33.5 45.3 52.7 37.7 34.1 38 25.8 27.7 36.8
Ours * $ §, with Leaky ReLU 38.6 42 38.9 42.9 43.4 47.5 39.7 41 52.7 63 43.6 39.1 43.7 30.4 33.2 42.7
Ours with Sigmoid * 177 200 198 189 191 188 190 215 221 244 199 186 191 190 200 200
Ours + $,optimized with Leaky Relu 34.3 36.2 34.2 37.2 36.8 42.8 33.9 33.2 44.8 52 37.6 33.7 37.8 25.6 27.2 36.5
Table 6.2: Protocol 2: Compare the reconstruction Error after rigid alignment with the ground truth
(P-MPJPE) of [45] with our optimizations.
[45], Single frame 12.8 12.6 10.3 14.2 10.2 11.3 11.8 11.3 8.2 10.2 10.3 11.3 13.1 13.4 12.9 11.6
[45], Temporal 3 3.1 2.2 3.4 2.3 2.7 2.7 3.1 2.1 2.9 2.3 2.4 3.7 3.1 2.8 2.8
Ours * $ § temporal, with Leaky ReLU 3.2 3.3 2.5 3.7 2.5 2.9 2.8 3.2 2.2 3 2.4 2.6 3.9 3.4 3.1 3
Ours * temporal, with Sigmoid 3.9 4.2 3.5 4.8 3.3 3.6 3.5 4.8 2.8 3.7 3.3 3.6 5.5 5.9 5.3 4.1
Ours +$ temporal, optimized with Leaky Relu 2.9 3 2.5 3.4 2.3 2.7 2.6 3.1 2.1 2.8 2.3 2.4 3.7 3.1 2.8 2.8
Table 6.3: Velocity error over the 3D poses generated by a convolutional model that considers time and
a single frame baseline.
(*) Test using a pre-trained data-set [89], with fine-tuned CPN and receptive field 243 frames, until 60
epochs. (+) Test with training the data-set from scratch with fine-tuned CPN fine-tuned and
receptive field 243 frames, until 80 epochs and 1024 batch size. ($) Uses temporal information. (§)
uses Ground-truth bounding box.
When examining tables 6.1 and 6.2 to show the various errors in both protocol 1 and 2, we
will be surprised first of all with the enormous resulted errors by training using Sigmoid activation
function, which confirms practically that that function is not the proper choice in case of working with
CNN models.
The first result of testing a pre-trained H36m data-set [89] with Leaky Relu was very disap-
pointing, where that test reported an apparent variation in comparison to the results of [45] in the same
conditions. However, when we used our trained data-set from scratch with Leaky Relu , we were able
to achieve a marked reduction of the reconstruction error, which compete for the training with Relu
[45] as shown in figures (6.1, 6.2). On the other hand, Sigmoid has an ambitious velocity error to Relu
and Leaky-Relu with a difference about 1.3 mm.
69
Figure 6.1: Reconstruction error in Protocol 1 with Human36M
6.2.2 Reconstruction error on HumanEva-I data-set
Table 6.4 shows the results more similar to the data in Human 36M concerning the error ratio with
Sigmoid and Leaky Relu in both cases compared to [45], which is obvious perceived in figure 6.3 as
well.
Figure 6.2: Reconstruction error in Protocol 2 with Human36M
Figure 6.3: Velocity error with Human36M
71
# Walk Walk Walk Jog Jog Jog Box Box Box
# S1 S2 S3 S1 S2 S3 S1 S2 S3
[45], SA 14.5 10.5 47.3 21.9 13.4 13.9 24.3 34.9 32.1
[45], MA 13.9 10.2 46.6 20.6 13.1 13.8 23.8 33.7 32
Ours +, MA with LeakyReLU 22.5 21.2 63.4 26.9 23.9 28 28.6 37 37.7
Ours +, MA Sigmoid 220.1 234.1 230.5 194 209.7 210.2 168.8 222.8 236.4
Ours *,MA with LeakyReLU 13.8 10 46.9 20.7 13 13.7 24.3 33.4 32
Table 6.4: Error of HumanEva-I under protocol 2 for a single action (SA) and multi-action (MA)
models.
SA: Training a different model for every action. MA: training one model for all the actions. +: Test
by an evaluated pre-trained data-set and Mask R-CNN detections with receptive field of 27 frames
[45]. *: Test by training a new model for 1000 epochs using Mask R-CNN detections.
Figure 6.4: Reconstruction error in Protocol 2 with HumanEva-I
72
Semi-supervised training
We have operated this evaluation for two hundred epochs and use a receptive field of 27 frames
and only 0.1% of S1 with 64 batch size. We started with five epochs iteration as warming up training
over the labeled data, before attaching the semi-supervised. Moreover, we alternated the CPN 2D
keypoints for ground-truth 2D poses to attain the best possible outcomes.
Ours, (k) 43.2 51.9 56.8 52.3 62.7 88.5 54.7 56 68.3 130 55.6 60 66 49.6 51.8 63.2
Ours, (j) 43.1 51.5 56.3 51.8 60.8 86.9 54.8 56.5 65.7 122.3 54.5 60.2 64.5 49.3 51.6 62
Ours, (l) 47.1 62.6 60.4 54.2 73.4 89.9 59.3 69.1 76.6 122.1 60.3 67.3 66.4 48.3 59.4 67.8
Table 6.5: Evaluation the semi-supervised training with Human 3.6 M data-set under protocol 2.
Ours, (k) 34.4 41.1 40.2 41 47 58.2 43.1 41 53.1 95.7 44 45.8 48.8 38.8 40.3 47.5
Ours, (j) 34.4 41.2 40 40.6 46.8 58.1 43.5 41.1 52.2 92.2 43.5 46.5 48.4 38.9 40.8 47.2
Ours, (l) 37.8 45.9 43.7 42.9 51 61.7 46.2 47.7 55.7 93.9 45.7 50.5 48.1 41.5 47.4 50.7
Table 6.6: Evaluation the semi-supervised training with Human 3.6 M data-set under protocol 3.
(k) Evaluation using Relu, (j) Evaluation using Leaky-ReLU, (l)Evaluation using Sigmoid.
Tables 6.5 and 6.6 denote that Leaky-Relu outperforms Relu with a rate of 1.2 mm in
action-wise-average under protocol 2 and 0.3 mm under protocol 3, what represents a broader ratio
than in Supervised.
Attractively has Sigmoid reduced the reconstruction errors much vastly in the semi-supervised
training to be adjacent to the other activation functions allowing itself to get on the track of the
competition in CNNs to exploit it in a better way.
We will discuss the diverse performance of the three activation functions in both supervised
and semi-supervised training in the next chapter.
73
Figure 6.5: Reconstruction error in Protocol 2 with training H36M dat-set in semi-supervised
Figure 6.6: Reconstruction error in Protocol 3 with training H36M dat-set in semi-supervised
74
Chapter 7
Conclusion
In this chapter, we will discuss our results and see to what extent they fulfilled the object of this
research. Additionally, we will present some concluding observations and a future vision in this area.
7.1 Our Vision
Our theory is constituted on, how to optimize and accurate the 3D Human Body Pose Estimations.
That led us to begin with analyzing and comprehending the primary technology of building some 3D
HBP-models, which is Convolution neural Network CNNs. We realized in the meantime that the
activation functions are the cornerstone for the neural network models, that is because the activation
function is the connection between the input and output signals. For our hypothesis we selected three
functions to validate and test:
Relu is the most used one in CNNs models and has pretty remarkable results. Sigmoid is a successful
and interactive function in the RNN models. Leaky-Relu is an effort to increase the range of Relu and
solve its dyings, but it remains not widely used in CNN structures like Relu. Besides, we studied the
different hierarchy of the layers like Res-Net(50, 101, 150, etcetera) and the used loss functions in
neural networks and their interactive role with the activation functions, and found out that L2 is the
proper one for fine-tuning the CNNs performance.
Finally, after some comparisons between the well-known data-sets, we decided Human36M
75
and HumanEva-I to be our main benchmarks to validate our proposal empirically.
7.2 Workout
It was not so easy to prepare a valid environment to achieve our implementation, especially with
various requirements from packages and libraries to dependencies, which are needed to the 3D-HBPE
frameworks; hence the time consumed for errors troubleshooting and adapt them together to overcome
the confronted issues during training and to test the data-sets.
What also made our mission more complicated that we have had to inject one of our proposed
activation function in the used model of CNN and run it many times in different training forms, where
some tests need to be accomplished in more than 24 hours.
As mentioned before, we divided our examinations into two types of training to observe the
performance of each activation function separately. In the Supervised training using both data-sets
presented Relu and Leaky-Relu almost the same performance with little error variations in some actions,
while the Sigmoid has presented an enormous range of errors outweighing thousands of error millimeters
in protocol 1. Semi-supervised training has inverted the scales of the supervised, where Leaky-Relu
outperformed Relu and Sigmoid attained an incredible improves, where its reconstructions error become
quite comparable to Relu and Leaky Relu.
7.3 Results and discussions
To figure out the main reasons for alternating the outcomes for every activation function in both state-
of-the-arts estimation we need to understand the nature of each activation function and correlate it
with the parameters of the different factors in every training mode.
In the case of Leaky-Relu, we ran the evaluation two times in the supervised training, the
first time by evaluating a pre-trained data-set [45] with 60 epochs, at the second one we evaluated a
trained data-set from scratch with 80 epochs. In semi-supervised training, both Relu and Leaky-Relu
were evaluated with 200 epochs.
The conceptual intuitive of term epochs, it is the number times that takes to pass the entire
76
data-set from input to output and back again in order to help the learning model to reach the best
probable error minimization. Accordingly, increasing the number of epochs helps the activation func-
tion to update the internal model parameter in every sample in the data-set. The extra rectifying in
Leaky-Relu function improves it to outperform Relu in some high dynamically actions like boxing in
HumanEva-I, or with occluded keypoints like in Eating and setting in H36M data-set.
The significant fluctuation in the reconstruction errors in case of Sigmoid in supervised and
semi-supervised training refers to that Sigmoid provides a good predictivity for the output only when
the probability is in the range between (0,1) [94]. Our followed semi-supervised training approach [45]
bases on considering the various subsets of Human36m data-set as labeled data and the remaining
samples are used as unlabeled data. The general setup initializes with downsampling the data, which
mean discarding all the negative samples, and uses only one single frame model for 0.1% of S1 in the
Human36M. That enables Sigmoid to perform critically better in semi-supervised training rather than
in supervised because it handles various subsets in the data-set as labeled data without downsampling.
7.4 Futur work
The field of deep learning is developing rapidly, and every day, there is tangible progress on CNN in
term of enhancing the capabilities of objects detections and human body pose estimations in both 2D
and 3D. Our papers focused on the validation of the activation functions in the CNN models in order
to improve the performance of the 3D pose estimations in two training methods. However, we still need
to investigate the characteristics of the three activation function in unsupervised training and with
visualizing custom videos.
Leaky-Relu ought to be more efficient with the hard detected keypoints by investing more in the
performance of that function with the different CNN algorithms and adjust it with some improvement
methods like CPN.
Lastly, working with new activation functions like tanh in order to test their efficiency with
the different CNN models could enrich our attempts to improve the 3D pose estimation.
Appendix
We attached with our hard-copy of the Master thesis a digital-copy in USB-stick contains:
1. The source script, which we used to run our implementations.
2. The results of the different evaluations separated in files upon the used function, the training
method and the used data-set.
3. The generated GIF files of visualizing the results using HumanEva-I.
4. A soft copy of our thesis.

Bibliography
[1]. SAM JOHNSON and MARK EVERINGHAM, "Learning effective human pose estimation
from inaccurate annotation," IEEE conference on computer vision and pattern recognition, 2011. URL:
http://www.comp.leeds.ac.uk/me/publications/cvpr11 pose.pdf.
[2]. CIRESAN DAN, MEIER U, and SCHMIDHHUBER J. (June 2012). "Multicolumn deep
neural networks for image classification." 2012 IEEE conference on computer vision and pattern recog-
nition: 3642-3649. arXiv:1202.2745. DOI:10.1109/cvpr.2012.6248110. ISBN 978-1-4673-12228-8.
[3]. KRIZHEVSKY ALEX, SUTSKEVER ILYA, and HINTON GEOFFRY (2012). "ImageNet
classification with deep convolutional neural networks," Neural information processing system, Lake
Tahoe, Nevada.
[4]. DENG, L and Yu, D. (2014). "Deep Learning: Methods and applications," foundations
and trends in signal processing.
[5]. THOMAS ANDY, “An Introduction To Neural Networks For Beginners,” Adventures in
Machine Learning.
[6]. GOODFELLOW, IAN; BENGIO, JOSHUA; COURVILLE, ARON, "Deep Learning"
[7]. Wikimdeia Inc. 2019: https://en.wikipedia.org/wiki/Deep learning and convolutional

neural networks., San Francisco, 2019.
[8]. JAIN, V; ROTH, F; MURRAY, J.F; TURAGA, S; ZHINGULIN, V; HELMSTAEDTER,

M.N; BRIGGMAN, K.L; DENK, W and SEUNG H.S, 2007, “Supervised Learning Of Image Restoration
With Convolutional Neural Networks,” Computer Vision, 2007. ICCV 2007. IEEE Eleventh Interna-
tional Conference on IEEE.
[9]. Pinheiro, P. H. O., and Collobert R, "Recurrent CNNs for scene labeling," ICML 2014.
[10]. Pinheiro, P. H. O., and Collobert R, "From image-level to pixel-level labeling with
CNN’s," In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[11]. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers:
Surpassing Human-Level Performance on ImageNet classification. CoRR, abs/1502.01852, 2015.
[12]. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
79
image recognition. CoRR, abs/1512.03385, 2015.
[13] Buehler, Patrick, et al. "Upper body detection and tracking in extended signing se-
quences."International journal of computer vision 95.2 (2011): 180-197.
[14] Sapp, Benjamin, David Weiss, and Ben Taskar. "Parsing human motion with stretchable
models."Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
[15]. Girshick, Ross, et al. "Efficient regression of general-activity human poses from depth
images." 2011 International Conference on Computer Vision. IEEE, 2011.
[16]. http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/
[17]. Jean-Claude Bermond, Michel Cosnard, Stéphane Pérennes. Directed acyclic graphs with
the unique dipath property. Theoretical Computer Science, Elsevier, 2013, 504, pp.5-11.
<10.1016/j.tcs.2012.06.015>. <hal-00869501>
[18]. Vera Kurková and Paul C. Kainen, "Functionally Equivalent Feedforward Neural Net-
works.", January 1994 Neural Computation 6:543-558
[19]. Alexander Toshev and Christian Szegedy. DeepPose: Human pose estimation via deep
neural networks. CoRR, abs/1312.4659, 2013.
[20]. Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore,
Eduard Säckinger, and Roopak Shah. Signature verification using a “siamese.” Time-Delay neural
network. International Journal of Pattern Recognition and Artificial Intelligence, 7(04):669–688, 1993.
[21]. Tomas Pfister, Karen Simonyan, James Charles, and Andrew Zisserman. Deep CNNs for
efficient pose estimation in gesture videos. In Asian Conference on Computer Vision (ACCV), 2014. 1,
3-2, 3-5
[22]. Arjun Jain, Jonathan Tompson, Mykhaylo Andriluka, Graham W. Taylor, and Christoph
Bregler. Learning human pose estimation features with convolutional networks. CoRR, abs/1312.7302,
2013.
[23]. Fei Han, Brian Reily, William Hoff, and Hao Zhang. Space-Time representation of people
based on 3D skeletal data: A review. CoRR, abs/1601.01006, 2016.
80
[24]. Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M:
Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2014.
[25]. Leonid Sigal, Alexandru O. Balan, and Michael J. Black. Human Eva: Synchronized
video and motion capture dataset and baseline algorithm for evaluation of articulated human motion.
Int. J. Comput. Vision, 87(1-2):4–27, March 2010.
[26]. Rodney J. Douglas and Kevan A.C. Martin. Recurrent neuronal circuits in the neocortex.
Current Biology, 17(13): R496 – R500, 2007.
[27]. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks.
In Geoffrey J. Gordon and David B. Dunson, editors, Proceedings of the Fourteenth International
Conference on Artificial Intelligence and Statistics (AISTATS- 11), volume 15, pages 315–323. Journal
of Machine Learning Research - Workshop and Conference Proceedings, 2011.
[28]. Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick, "Mask R CNN," arXiv:1703.06870v3
[cs.CV] 24 Jan 2018
[29]. Tutorials Point (I) Pvt. Ltd. - Copyright 2019 by Tutorials Point (I) Pvt. Ltd. -
https://www.tutorialspoint.com/pytorch/pytorch-utorial.pdf
[30]. NumPy User Guide Release 1.11.0 - Written by the NumPy community -
https://docs.scipy.org/doc/numpy-1.11.0/numpy-user-1.11.0.pdf
[31]. HDF5 for Python - Andrew Collette and contributors Revision 8d96a14c. -
http://docs.h5py.org/en/stable/
[32]. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, No. 7, July
2014
[33]. Max Planck Institute for Intelligent Systems (from now on "MPI") and the Max Planck
Institute for Biological Cybernetics (from now on “KYB”), http://humaneva.is.tue.mpg.de/data-license
[34]. 3D human pose estimation in video with temporal convolutions and semi-supervised
training, Conference on Computer Vision and Pattern Recognition (CVPR) 2019
81
[35] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler.
Efficient object localization using convolutional networks. CoRR, abs/1411.4280, 2014.
[36]. Kunihiko Fukushima. Neocognitron: A Self-Organizing neural network model for a

mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202,
1980.
[37]. Dushyant, Mehta1; Srinath, Sridhar; Oleksandr, Sotnychenko1; Helge, Rhodin; Mo-
hammad, Shafiei; Hans-Peter, Seidel1; Weipeng, Xu; Dan, Casas and Christian, Theobalt, "VNECT:
Real-time 3D Human Pose Estimation with a Single RGB Camera", Max Planck Institute for Infor-
matics (GVV Group), Saarland University, Universidad Rey Juan Carlos and ACM Transactions on
Graphics (SIGGRAPH 2017), Los Angeles, USA.
[38] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with
deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,
editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates,
Inc., 2012.
[39] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-
Fei. ImageNet large scale visual recognition challenge. International Journal of Computer Vision
(IJCV), 115(3):211–252, 2015.
[40]. Michael A Nielsen. Neural networks and deep learning, 2015.
[41]. Leon Bottou, "Large-Scale Machine Learning with Stochastic Gradient Descent", NEC
Labs America, Princeton NJ 08542, USA.
[42]. CHIGOZIE ENYINNA NWANKPA, IJOMAH WINIFRED, GACHAGAN ANTHONY,

and MARSHALL STEPHEN, "Activation Functions: Comparison of Trends in Practice and Research
for Deep Learning", arXiv:1811.03378v1 [cs.LG] 8 Nov 2018.
[43]. L. Deng, “A tutorial survey of architectures, algorithms, and applications for deep learn-
ing,” APSIPA Transactions on Signal and Information Processing, vol. 3, 2014. [Online]. Available:
https://doi.org/10.1017/atsip.2013.9
[44]. M. Z. Alam, T. M. Taha, C. Yakopcic, S. Westberg, M. Hasan, B. V. Essen, and V. K.

Asari, “The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches,”
82
arXiv, 2018. [Online]. Available: http://arxiv.org/abs/1803.01164
[45]. DARIO, PAVLO; CHRISTOPH, FEICHTENHOFER; DAVID, GRANGER and MICHAEL,

AULI, „3D Human Pose Estimation In Video With Temporal Convolutions And Semi-Supervised Train-
ing “, Facebook AI Research in CVPR 29 March 2019.
[46]. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.

Jackel, “Backpropagation Applied to Handwritten Zip Code Recognition,” Neural Computation, vol. 1,
no. 4, pp. 541–551, 1989. [Online]. Available: https://doi.org/10.1162/neco.1989.1.4.541
[47]. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convo-
lutional neural networks,” 2012. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary
[48]. K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale
Image Recognition,” arXiv, 2015. [Online]. Available: https://arxiv.org/pdf/1409.1556.pdf
[49] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,

and A. Rabinovich, “Going deeper with convolutions„” in Conference on Computer Vision and Pattern
Recognition (CVPR), vol. 2015. IEEE, 2015, pp. 1–9.18
[50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,”
in IEEE Conf. Comput. Vis. Pattern Recognit. IEEE, 2016, pp. 770–778. [Online]. Available:
https://doi.org/10.1109/CVPR.2016.90
[51] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic
depth,” ArXiv, vol. 9908, no. LNCS, pp. 646–661, 2016. [Online]. Available: 10.1007/978-3-319-46493-
0-39;http://arxiv.org/abs/1603.09382
[52] J. Turian, J. Bergstra, and Y. Bengio, “Quadratic features and deep architectures for
chunking,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North
American Chapter of the Association for Computational Linguistics„ vol. Companion Volume:, 2009,
pp. 245–248. [Online]. Available: https://dl.acm.org/citation.cfm
[53] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and undefined R. Salakhutdinov,

“Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning
Research, vol. 15, pp. 1929–1958, 2014.
[54] D. J. Matas, “All you need is a good init„” ArXiv, 2016. [Online]. Available: Http: //
83
arxiv.org/abs/1511.06422
[55] P. Luo, X. Wang, W. Shao, and Z. Peng, “Understanding Regularization in Batch Nor-
malization,” arXiv, 2018. [Online]. Available: http://export.arxiv.org/abs/1809.00846
[56] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift„” in Proceedings of the 32nd International Conference on Machine
Learning, PMLR. JMLR, 2015, pp. 448–456.
[57] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neu-
ral networks,” in International Conference on Artificial Intelligence and Statistics, vol. 9. PLMR,
2010, pp. 249–256. [Online]. Available: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc-
location=ufi
[58] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for Activation Functions,” ArXiv,
2017. [Online]. Available: 1710.05941;Http: //arxiv.org/abs/1710.05941
[59] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in Inter-
national Conference on Machine Learning, 2011. [Online]. Available: Http: // proceedings.mlr.press /
v15/glorot11a/glorot11a.pdf
[60]. K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification,” arXiv, 2015. [Online]. Available: Http: //
arxiv.org/abs/1502.01852
[61] S. M. Noor, J. Ren, S. Marshall, and K. Michael, “Hyperspectral Image Enhancement

and Mixture Deep-Learning Classification of Corneal Epithelium Injuries,” Sensors. Multidisciplinary
Digital Publishing Institute, vol. 17, no. 11, p. 2644, 2017.
[62] T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A. Mohamed, G. Dahl, and B. Ram-

abhadran, “Deep Convolutional Neural Networks for Large-scale Speech Tasks,” Neural Networks, vol.
64, pp. 39–48, 2015. [Online]. Available: https://doi.org/10.1016/J.NEUNET.2014.08.005
[63] A. Graves, A. Mohamed, and undefined G. Hinton, “Speech recognition with deep recurrent
neural networks„” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
IEEE, 2013, pp. 6645–6649.
[64] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-
84
Decoder Architecture for Image Segmentation,” arXiv, 2015.
[Online]. Available: http: //arxiv.org/abs/1511.00561
[65] R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick, “Learning to Segment Every Thing,”
arXiv, 2017. [Online]. Available: 10.1109/CVPR.2018.00445;http://arxiv.org/abs/1711.10370
[66]. B. Karlik and A. Vehbi, “Performance Analysis of Various Activation Functions in Gen-
eralized MLP Architectures of Neural Networks,” International Journal of Artificial Intelligence and
Expert Systems (IJAE), vol. 1, no. 4, pp. 111–122, 2011. [Online]. Available:
http://www.cscjournals.org/library/manuscriptinfo.php
[67] D. L. Elliott, “A Better Activation Function for Artifical Neural Networks,” 1993. [Online].
Available: https://drum.lib.umd.edu/handle/1903/5355
[68]. R. M. Neal, “Connectionist learning of belief networks,” Artificial Intelligence, vol. 56,
no. 1, pp. 71–113, 1992. [Online]. Available: https://doi.org/10.1016/0004-3702(92)90065-6
[69] M. Courbariaux, Y. Bengio, and J. P. David, “BinaryConnect: Training Deep Neural

Networks with binary weights during propagations.”
arXiv, pp.3123–3131, 2015. [Online]. Available: https://arxiv.org/pdf/1511.00363.pdf
[70] S. Elfwing, E. Uchibe, and K. Doya, “ “Sigmoid-Weighted Linear Units for,” arXiv, 2017.
[Online]. Available: http://arxiv.org/abs/1702.03118
[71]. A. Maas, A. Hannun, and A. Ng, “Rectifier Nonlinearities Improve Neural Network
Acoustic Models,” in International Conference on Machine Learning (icml), 2013.
[72]. M. Lin, Q. Chen, and S. Yan, “Network In Network,” arXiv, 2013. [Online]. Available:
http://arxiv.org/abs/1312.4400
[73]. M. D. Zeiler, M. Ranzato, R. Monga, M. Mao, K. Yang, Q. V. Le, and G. E. Hinton,

“On rectified linear units for speech processing,” in International Conference on Acoustics, Speech, and
Signal Processing. IEEE, 2013, pp. 3517–3521, IEEE. https://doi.org/10.1109/ICASSP.2013.6638312.
[74]. B. Xu, N. Wang, H. Kong, T. Chen, and M. Li, “Empirical Evaluation of Rectified Acti-
vation’s in Convolution Network,” arXiv, 2015. [Online]. Available: https://arxiv.org/abs/1505.00853
85
[75] X. Jin, C. Xu, J. Feng, Y. Wei, J. Xiong, and S. Yan, “Deep Learning with S-shaped
Rectified Linear Activation Units„” arXiv, pp. 1737–1743, 2015.
[Online]. Available: https://arxiv.org/pdf/1512.07030.pdf 20
[76] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia, “Incorporating Second-Order

Functional Knowledge for Better Option Pricing,” in Advances in Neural Information Processing Sys-
tems (NIPS), vol. 13, 2001, pp. 472–478. [Online]. Available: https://papers.nips.cc/paper/1920-
incorporating second-order-functional-knowledge-for-better-option-pricing
[77]. D. A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and Accurate Deep Network
Learning by Exponential Linear Units (ELUs),” arXiv, 2015.
[Online]. Available: http: //arxiv.org/abs/1511.07289
[78]. I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout

networks,” 2013, p. 1319. [Online]. Available: https://dl.acm.org/citation.cfm
[79]. Y. LECUN, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp.
436–444, 2015. [Online]. Available: https: //doi.org/10.1038/nature14539
[80]. GeeksforGeeks 2019: https://www.geeksforgeeks.org/seq2seq-model-in-machine-learning/,

Noida, Uttar Pradesh - 201305, India
[81]. Embedded Vision Alliance 2019:https://www.embedded-vision.com/platinum-members

/cadence/embedded-vision-training /documents/pages/neuralnetworksimagerecognition, 1646 North Cal-
ifornia Blvd., Suite 360 Walnut Creek, CA 94596 USA Phone: +1 (925) 954-1411
[82]. corochannNote 2019: https://corochann.com/mnist-training-with-multi-layer-perceptron-

1149.html,
[83]. Overlead: https://www.overleaf.com/latex/examples/neural-network-color/jwsbrhgwmgmt,

2019
[84]. Github: https://github.com/HarisIqbal88/PlotNeuralNet, 2019
[85]. NN-SVG Publication-ready NN-architecture
schematics: http://alexlenail.me/NN-SVG/LeNet.html, 2019
86
[86]. Medium, toward data science: https://towardsdatascience.com/understanding-encoder-
decoder-sequence-to-sequence-model-679e04af4346, 2019
[87]. Github: https://github.com/MartinThoma/LaTeX-examples/blob/master/tikz/sigmoid-

function/sigmoid-function.tex, 2019
[88]. Stack Exchange Network: https://tex.stackexchange.com/questions/128517/packages-

to-plot-functions-such-as-the-logistic-function-or-s-shaped-function,
[89]. martinez_2017_3dbaseline, title=A simple yet effective baseline for 3d human pose
estimation, author=Martinez, Julieta and Hossain, Rayat and Romero, Javier and Little, James J.,
booktitle=ICCV, year=2017
[90]. https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
[91]. http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/
[92] Mask R-CNN, Facebook AI Research (FAIR), arXiv:1703.06870v3 [cs.CV] 24, Jan 2018.
[93] Cascaded Pyramid Network for Multi-Person Pose Estimation, arXiv:1711.07319v2 [cs.CV]
8 Apr 2018.
[94]. Klaus Greff, Rupesh K. Srivastava, Jan Koutnık, Bas R. Steunebrink, Jurgen Schmidhu-
ber, "LSTM: A Search Space Odyssey", arXiv:1503.04069v2 [cs.NE] 4 Oct 2017
[95] CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark - Shanghai
Jiao Tong University, Tsinghua University, 2019.
[96] Deep Residual Learning for Image Recognition, Microsoft Research, 2015.
[97] 3D Ego-Pose Estimation via Imitation Learning - Ye Yuan[0000000153166002] and Kris

Kitani[0000000293894060] Carnegie Mellon University, Pittsburgh PA 15213, USA., 2018.
87

Validation and Optimization of 3d-Human Body Pose Estimation Approaches For Use in Motion Analysis, Ab. Sami Noorzad and Malek Zedan

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Validation and Optimization of 3d-Human Body Pose Estimation Approaches For Use in Motion Analysis, Ab. Sami Noorzad and Malek Zedan

Hochgeladen von

Copyright:

Verfügbare Formate

VALIDATION AND OPTIMIZATION OF

3D-HUMAN BODY POSE ESTIMATION

Abdul Sami Noorzad 1F and Malek Zedan 2F

Abdul Sami Noorzad

Mom, despite the long distance between us,

Copyright @ September 2019 by Abdul Sami Noorzad and Malek Zedan.

1.1 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.1.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.1.3 Deep learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.1.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.1.7 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.1.8 Weight Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.1.9 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.1.10 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.2 Multilayer Neural Networks (MLNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.2.1 Feed forward multi-layer neural network (FFMLNN) . . . . . . . . . . . . . . . . 31

2 Convolutional Neural Networks 32

2.1 CNNs Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.1.1 Reason Behind Using CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.2 Applications of CNNs in HBPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.2.1 Heat Map Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.2.2 Deep Pose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2.3 Pose regression CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2.4 CNNs for binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2.5 Recurrent Neural Network (RNN) . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.2.6 Seq2Seq Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1 What is an Activation Functions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.1.1 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.1.2 Tangent Hyperbolic (Tanh) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.1.3 Hard Hyperbolic Function (Hard Tanh) . . . . . . . . . . . . . . . . . . . . . . . 48

3.1.4 Rectifier Linear Unit Function (ReLU) . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1.5 Leaky ReLU or Parametric Rectified Linear Unit (PReLU) . . . . . . . . . . . . 49

3.1.6 Randomized Leaky ReLU (RReLU) . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.1.8 Softplus Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.1.9 Exponential Linear Unit (ELU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.1.10 Maxout Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Features of Data-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1.1 Human 3.6M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.1.2 Human Eva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1.3 COCO Data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.1 Least Absolute Deviations (L1 ) Loss Function . . . . . . . . . . . . . . . . . . . . 60

4.2.2 Least Square Errors (L2 ) Loss Function . . . . . . . . . . . . . . . . . . . . . . . 60

5.2 Used Libraries and Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3.1 Mask R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3.2 Stacked hourglass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3.3 Cascaded Pyramids Network (CPN) . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4 Preparing the datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.5 Implementation and visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.6 Challenges and faced problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.1 Reconstruction errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.2 Supervised Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.2.1 Reconstruction error on Human3.6M data-set . . . . . . . . . . . . . . . . . . . . 68

6.2.2 Reconstruction error on HumanEva-I data-set . . . . . . . . . . . . . . . . . . . . 70

7.1 Our Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

7.3 Results and discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

MALEK ZEDAN and ABDUL SAMI NOORZAD

1.1 Biological and artificial neural networks [81] . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.2 Multi-layer neural network [82]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1.3 A basic feed forward multi-layer neural network [82]. . . . . . . . . . . . . . . . . . . . . 31

2.1 A rough diagram for a Convolutional Neural Network [84]. . . . . . . . . . . . . . . . . . 33

2.2 Architecture of a traditional convolutional neural network [85]. . . . . . . . . . . . . . . 37