Hand Gesture Recognition Using 3D Convolutional Neural Network Under Different Viewpoints

Hand gesture recognition using
3D Convolutional Neural Network

under different viewpoints
Student : Manh-Truong Dang

Student ID : 20134209 – K58 PFIEV SIC
Supervisor : Dr. Thanh-Hai Tran
Hanoi – 06/2018
Introduction
 Hand gesture recognition
Video stream (already temporally segmented)
On/Off
Dynamic hand Back
Gesture Next
gesture
label Increase
recognition Decrease
...
 Challenges:
- Hand: low resolution, deformable, high degrees of freedom
- Camera: view point change
- Subject: different style of performing the same gestures
(including phase variation)
Page  2
- Environment: complex background, illumination variation
Introduction
 Hand gesture recognition
Video stream (already temporally segmented)
On/Off
Dynamic hand Back
Gesture Next
gesture
label Increase
recognition Decrease
...
 Challenges:
- Hand: low resolution, deformable, high degrees of freedom
- Camera: view point change
- Subject: different style of performing the same gestures
(including phase variation)
Page  3
- Environment: complex background, illumination variation
Motivation
 Example of viewpoint change and complex
background
View 1
Hand gesture recognition should be robust to viewpoint changes

and complex background in order to be applied to practical situation
But no existing work has evaluated in detail

the robustness of recognition method to such variation
View 2
Page  4
Objectives
 Study one advanced method of hand gesture recognition
– 3D convolutional neural network [1] (suitable and efficient for
video based tasks)
– and its extension to two streams [2]
 Evaluate the robustness of this method to viewpoint
change and complex background
– working on a multi-modal multi-view dataset of dynamic hand
gestures in an application of controlling home appliance using
hand gestures
[1] Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3d convolutional networks." In 2015 IEEE International Conference on
Computer Vision (ICCV), pp. 4489-4497.
[2] Van-Minh Khong, Thanh-Hai Tran, Improving human action recognition with two-stream 3D
convolutional neural network, The first International Conference on Multimedia analysis and Pattern
Recognition, Hochiminh City, April. 6-7, 2018
Page  5
Proposed study (1)
3D convolutional Neural Network (C3D) [1]
 Main ideas
– 2D CNN (convolutional neural network) is suitable for still
image based tasks as it represents spatial correlative information
in an image
Example: ImageNet,....
– 3D convolution kernels can exploit temporal pattern besides
spatial information, while eliminating the need for secondary
temporal modeling techniques => C3D net uses 3D convolution
Page  6
Proposed study (1)
3D convolutional Neural Network (C3D) [1]
 C3D architecture
– 8 conv. layers, 5 pooling layers, 2FC layers
– Kernel size: 3x3x3, all pooling size: 2x2x2 except first pooling:
1x2x2
– Number of feature maps for convolutional layer: 64, 128, 256,
256, 512, 512, 512, 512
– Input: a 16-frame clip, each frame of size 128 x 171
Page  7
Proposed study (2)
Two stream C3D [2]
 Main ideas:
– C3D uses only RGB data
– Optical Flow or other stream could be additional information for
recognition => two streams C3D (RGB+Optical Flow)
 Two stream C3D architecture:
– Early fusion: FC-6 features are L2-normalized, concatenated and

Both [1] and [2] work only on human action recognition
then classified by an (UCF101,
SVM HMBD51)
– Late fusion: Predictions of both streams (prob layer) are
... no evaluation
averaged of C3D
to form the to hand
final gestures under different viewpoints
prediction
Page  8
Proposed study (3)
Deploy [1] and [2] into hand gesture recognition
 Main ideas:
– Pre-processing of hand gesture data
– Investigation of different methods for phase variation problem
– Fine-tune parameters of two streams C3D on human gestures
dataset under different viewpoints
– Investigation of the effects of complex background on recognition
Page  9
accuracy
Proposed study (3.1)
Pre-processing hand gesture data
 Manually spotting of gestures from stream
 Computing Optical Flow stream based on [3]
– Optical flow: characterize movement of pixels between consecutive
images I (u, v, 2k  1)  dxk 1 (u, v) I (u, v, 2 k)  dyk 1 (u, v)
Two consecutive frames and two optical flows in vertical and horizontal dimensions
 Stack Optical Flow as a 3D volume (dx, dy, 0)

Page  10
[3] http://www.robots.ox.ac.uk/~vgg/software/two_stream_action/
Investigation of different sampling methods
 Phase variation: the length of a gesture class could be
different from subject to subject
 Solution of original C3D and two streams C3D:
– Divide the total video into non overlapped 16frames clips
– Input each 16 frames clip into C3D to generate a feature vector
or a score
– Average the feature vectors / scores of all 16 frames clips for
further classification or decision making
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Duplicate the last frame
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 23 23 23 23 23 23 23 23 23
The first 16 frame clip The second 16 frame clip
Page  11
Investigation of different sampling methods
 My solution to avoid “static effect”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Take previous overlapped
frames
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
The first 16 frame clip The second 16 frame clip
 Other solutions
– Randomly select 16 frames from the video [4]
– Select 16 most discriminative keyframes [5]
[4] Jing L., Yang X., and Tian Y. (2018). Video you only look once: Overall temporal convolutions for action
recognition. J Vis Commun Image Represent, 52, 58–65
[5] Van-Ngoc Nguyen, Thanh-Hai Tran, Thi-Lan Le, Van-Toi Nguyen, Thi-Thuy Nguyen, Temporal gesture
segmentation for recognition, The International Conference on Computing, Management and
Telecommunications (ComManTel 2013), 21-24 January 2013, Ho Chi Minh city, VietNam
Page  12
Fine tuning two streams C3D
 Fine tuning on RGB stream
– Initialize RGB-C3D with C3D model pre-trained on Sport1M
– Fine tune
• several layers of RGB-C3D using gesture dataset
• All layers of RGB-C3D using gesture dataset
Kinect 3
FCs only All layers

64.10% 94.00%
Use finetuing of all layers of RGB-C3D for evaluation

Page  13
 Fine tuning on RGB stream (all layers)
Page  14
 Fine tuning the overall OF-C3D using hand gesture dataset
on Optical Flow stream
– Initialize OF-C3D with C3D model pre trained on Sport1M (RGB)
– Initialize OF-C3D with C3D model pre trained on UCF (Optical
flow computed from UCF101 [2])
Method (on K5) Sport1M OF_UCF101
Only OF 93.28% 89.88%
Combined OF-RGB 99.33% 100%

(early fusion)
Combined OF-RGB 93.11% 100%
(latefusion)
In only OF-C3D, using pre-trained on Sport1m appears to be better.
Page  15
However, for two-stream C3D it is better to use UCF101

 Fine tuning the overall OF-C3D using hand gesture dataset
on Optical Flow stream
– Initialize OF-C3D with C3D model pre trained on UCF (Optical
flow computed from UCF101 [2])
Page  16
Investigation of the effects of complex background
on recognition accuracy
 Comparison with hand-segmented RGB frames
If there is a drastic increase in accuracy, we can conclude that

the environment does have an effect on recognition accuracy
Use a separate phase to extract

Page  17
regions of interest before recognition
Experimental setup
 Datasets ON_OFF
LEFT
– 5 gestures RIGHT
– 5 subjects UP
DOWN
– 5 views (only K1, K3, K5 are annotated)
K5
K2
1.5 m K3
K1 45o K4
Page  18
Experimental setup (cont’d)
 Datasets ON_OFF
LEFT
– 5 gestures RIGHT
– 5 subjects UP
DOWN
– 5 views (only K1, K3, K5 are annotated)
View\Gesture G1 G2 G3 G4 G5
K1 26 22 33 26 23
K3 26 22 33 26 23
K5 26 22 33 26 23
Page  19
Experimental setup (cont’d)
 Evaluation procedure: leave-one-out cross
validation and cross-view [6]
Cross-view:
- Train: View Ki, subjects {P1,..P5} / {Pj}
- Test: View Kk, subjects Pj
5 subjects (P1,..P5)
3 views
(K1, K3, K5) Leave-one-out:
5 gestures - Train: View Ki, subjects {P1,..P5} / {Pj}
- Test: View Ki, subjects Pj
Evaluation metric:
Page  20 [6] Doan H.-G., Vu H., and Tran T.-H. (2017). Dynamic hand gesture recognition from cyclical
hand pattern. IEEE, 97–100, 97–100
Experimental results
Train RGB Segmented OF OF-RGB OF-RGB
RGB early Late
Test K1 K3 K5 K1 K3 K5 K1 K3 K5 K1 K3 K5 K1 K3 K5
K1 76.27 45.60 50.07 89.97 65.37 54.09 70.98 35.45 39.31 75.67 55.30 47.53 74.10 52.07 45.61
K3 47.76 93.60 76.04 50.67 99.38 89.84 47.63 95.68 71.51 47.72 94.43 81.12 51.59 94.36 80.07
K5 30.41 65.77 96.67 42.49 93.29 99.05 38.49 89.47 93.28 36.70 68.17 100.0 41.55 70.83 100.0
Avr 64.68 76.01 64.64 67.40 67.79
 Single view gives quite good results (K3, K5). The view K1
gives the worst result because the hands are occluded or
out of camera field of view, movement of the hand is not
discriminative
 Background has strong impact on classification result.
 Optical Flow gives competitive performance w.r.t RGB
 Combined RGB and OF can boost performance
Page  21
Experimental results (cont’d)
Ground truth: DOWN Prediction: ON_OFF

This is because the hand is opened before it goes down.
Page  22 Therefore C3D mistakes this with ON_OFF
Input
Conv3 layer
It is hard to discriminate between ON_OFF, LEFT, and RIGHT

Page  23
in the Conv3 layer. This only happens in K1 view
RGB Optical flow
Input
Conv3 layer
Here in this example incorporating optical flow Information

Page  24
helps us better recognize the action in the conv3 layer
Train 16 frame clips [1] 16 randomly 16 keyframes

selected [4] [5]
Test K1 K3 K5 K1 K3 K5 K1 K3 K5
K1 76.27 45.60 50.07 75.59 56.60 41.97 71.15 51.79 33.61
K3 47.76 93.60 76.04 48.78 95.34 77.45 47.98 95.34 78.79
K5 30.41 65.77 96.67 36.15 56.25 97.33 38.31 61.51 96.67
Avr 64.68 65.05 63.90
16 randomly selected gives the best results while

taking the smallest computational time.
Page  25
Conclusions and future works
 Conclusions
– The performance of C3D remains stable under a small change of
viewpoint (<= 45 degrees)
– Background has strong impact on classification performance
(increased 11.32%)
– Incorporating Optical Flow (OF) channel in a two streams C3D
gives improved results (increased 3.11%)
 Future works
– Evaluate on remaining views (K2, K4)
– Comparison with existing method [6] (manifold based learning)
– Testing with automatic segmented hand regions.
– Adapting the C3D to be more robust to viewpoint change
Page  26
Thank you for your attention !
Page  27
Appendix: Related works
Molchanov et al., Online Detection and
Molchanov P., Gupta S., Kim K., et al.
Classification of Dynamic Hand Gestures
(2015). Hand gesture recognition with 3D
with Recurrent 3D Convolutional Neural
convolutional neural networks.
Networks
- Input is interleaved intensity - A recurrent neural network (RNN) is

and depth channels used that performs simultaneous
- Trained on a High-resolution detection and classification of
network (HRN) and a Low- dynamic hand gestures
resolution network (LRN) to - Connectionist temporal
And many
capture both coarse and classification (CTC) to predict in
others ...
fine-grained information unsegmented input streams
Page  28
Appendix: Keyframe sampling
 Keyframe sampling:
This strategy finds the frames with the highest L2
distances, i.e find Ii and Ij such that
is the largest. The chosen frames are then sorted

temporally.
-
Le T.L., Nguyen V.N., Tran T.T.H., et al. (2013). Temporal gesture segmentation for
Page  29 recognition. 2013 International Conference on Computing, Management and
Telecommunications (ComManTel), 369–373, 369–373

Hand Gesture Recognition Using 3D Convolutional Neural Network Under Different Viewpoints

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hand Gesture Recognition Using 3D Convolutional Neural Network Under Different Viewpoints

Hochgeladen von

Copyright:

Verfügbare Formate

Hand gesture recognition using

3D Convolutional Neural Network

Student : Manh-Truong Dang

Hand gesture recognition should be robust to viewpoint changes

But no existing work has evaluated in detail

– Early fusion: FC-6 features are L2-normalized, concatenated and

 Stack Optical Flow as a 3D volume (dx, dy, 0)

The first 16 frame clip The second 16 frame clip

The first 16 frame clip The second 16 frame clip

FCs only All layers

Use finetuing of all layers of RGB-C3D for evaluation

Only OF 93.28% 89.88%

Combined OF-RGB 99.33% 100%

However, for two-stream C3D it is better to use UCF101

If there is a drastic increase in accuracy, we can conclude that

Use a separate phase to extract

Avr 64.68 76.01 64.64 67.40 67.79

Ground truth: DOWN Prediction: ON_OFF

It is hard to discriminate between ON_OFF, LEFT, and RIGHT

Here in this example incorporating optical flow Information

Train 16 frame clips [1] 16 randomly 16 keyframes

K1 76.27 45.60 50.07 75.59 56.60 41.97 71.15 51.79 33.61

K3 47.76 93.60 76.04 48.78 95.34 77.45 47.98 95.34 78.79

K5 30.41 65.77 96.67 36.15 56.25 97.33 38.31 61.51 96.67

Avr 64.68 65.05 63.90

16 randomly selected gives the best results while

- Input is interleaved intensity - A recurrent neural network (RNN) is

is the largest. The chosen frames are then sorted

Das könnte Ihnen auch gefallen