Beruflich Dokumente
Kultur Dokumente
Hanoi – 06/2018
Introduction
Hand gesture recognition
Video stream (already temporally segmented)
On/Off
Dynamic hand Back
Gesture Next
gesture
label Increase
recognition Decrease
...
Challenges:
- Hand: low resolution, deformable, high degrees of freedom
- Camera: view point change
- Subject: different style of performing the same gestures
(including phase variation)
Page 2
- Environment: complex background, illumination variation
Introduction
Hand gesture recognition
Video stream (already temporally segmented)
On/Off
Dynamic hand Back
Gesture Next
gesture
label Increase
recognition Decrease
...
Challenges:
- Hand: low resolution, deformable, high degrees of freedom
- Camera: view point change
- Subject: different style of performing the same gestures
(including phase variation)
Page 3
- Environment: complex background, illumination variation
Motivation
Example of viewpoint change and complex
background
View 1
View 2
Page 4
Objectives
Study one advanced method of hand gesture recognition
– 3D convolutional neural network [1] (suitable and efficient for
video based tasks)
– and its extension to two streams [2]
Evaluate the robustness of this method to viewpoint
change and complex background
– working on a multi-modal multi-view dataset of dynamic hand
gestures in an application of controlling home appliance using
hand gestures
[1] Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3d convolutional networks." In 2015 IEEE International Conference on
Computer Vision (ICCV), pp. 4489-4497.
[2] Van-Minh Khong, Thanh-Hai Tran, Improving human action recognition with two-stream 3D
convolutional neural network, The first International Conference on Multimedia analysis and Pattern
Recognition, Hochiminh City, April. 6-7, 2018
Page 5
Proposed study (1)
3D convolutional Neural Network (C3D) [1]
Main ideas
– 2D CNN (convolutional neural network) is suitable for still
image based tasks as it represents spatial correlative information
in an image
Example: ImageNet,....
– 3D convolution kernels can exploit temporal pattern besides
spatial information, while eliminating the need for secondary
temporal modeling techniques => C3D net uses 3D convolution
Page 6
Proposed study (1)
3D convolutional Neural Network (C3D) [1]
C3D architecture
– 8 conv. layers, 5 pooling layers, 2FC layers
– Kernel size: 3x3x3, all pooling size: 2x2x2 except first pooling:
1x2x2
– Number of feature maps for convolutional layer: 64, 128, 256,
256, 512, 512, 512, 512
– Input: a 16-frame clip, each frame of size 128 x 171
Page 7
Proposed study (2)
Two stream C3D [2]
Main ideas:
– C3D uses only RGB data
– Optical Flow or other stream could be additional information for
recognition => two streams C3D (RGB+Optical Flow)
Two stream C3D architecture:
Main ideas:
– Pre-processing of hand gesture data
– Investigation of different methods for phase variation problem
– Fine-tune parameters of two streams C3D on human gestures
dataset under different viewpoints
– Investigation of the effects of complex background on recognition
Page 9
accuracy
Proposed study (3.1)
Pre-processing hand gesture data
Manually spotting of gestures from stream
Computing Optical Flow stream based on [3]
– Optical flow: characterize movement of pixels between consecutive
images I (u, v, 2k 1) dxk 1 (u, v) I (u, v, 2 k) dyk 1 (u, v)
Two consecutive frames and two optical flows in vertical and horizontal dimensions
Page 11
Proposed study (3.2)
Investigation of different sampling methods
My solution to avoid “static effect”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Take previous overlapped
frames
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Other solutions
– Randomly select 16 frames from the video [4]
– Select 16 most discriminative keyframes [5]
[4] Jing L., Yang X., and Tian Y. (2018). Video you only look once: Overall temporal convolutions for action
recognition. J Vis Commun Image Represent, 52, 58–65
[5] Van-Ngoc Nguyen, Thanh-Hai Tran, Thi-Lan Le, Van-Toi Nguyen, Thi-Thuy Nguyen, Temporal gesture
segmentation for recognition, The International Conference on Computing, Management and
Telecommunications (ComManTel 2013), 21-24 January 2013, Ho Chi Minh city, VietNam
Page 12
Proposed study (3.3)
Fine tuning two streams C3D
Fine tuning on RGB stream
– Initialize RGB-C3D with C3D model pre-trained on Sport1M
– Fine tune
• several layers of RGB-C3D using gesture dataset
• All layers of RGB-C3D using gesture dataset
Kinect 3
Page 14
Proposed study (3.3)
Fine tuning two streams C3D
Fine tuning the overall OF-C3D using hand gesture dataset
on Optical Flow stream
– Initialize OF-C3D with C3D model pre trained on Sport1M (RGB)
– Initialize OF-C3D with C3D model pre trained on UCF (Optical
flow computed from UCF101 [2])
Method (on K5) Sport1M OF_UCF101
Page 16
Proposed study (3.3)
Investigation of the effects of complex background
on recognition accuracy
Comparison with hand-segmented RGB frames
– 5 subjects UP
DOWN
– 5 views (only K1, K3, K5 are annotated)
K5
K2
1.5 m K3
K1 45o K4
Page 18
Experimental setup (cont’d)
Datasets ON_OFF
LEFT
– 5 gestures RIGHT
– 5 subjects UP
DOWN
– 5 views (only K1, K3, K5 are annotated)
View\Gesture G1 G2 G3 G4 G5
K1 26 22 33 26 23
K3 26 22 33 26 23
K5 26 22 33 26 23
Page 19
Experimental setup (cont’d)
Evaluation procedure: leave-one-out cross
validation and cross-view [6]
Cross-view:
- Train: View Ki, subjects {P1,..P5} / {Pj}
- Test: View Kk, subjects Pj
5 subjects (P1,..P5)
3 views
(K1, K3, K5) Leave-one-out:
5 gestures - Train: View Ki, subjects {P1,..P5} / {Pj}
- Test: View Ki, subjects Pj
Evaluation metric:
Page 20 [6] Doan H.-G., Vu H., and Tran T.-H. (2017). Dynamic hand gesture recognition from cyclical
hand pattern. IEEE, 97–100, 97–100
Experimental results
Train RGB Segmented OF OF-RGB OF-RGB
RGB early Late
Test K1 K3 K5 K1 K3 K5 K1 K3 K5 K1 K3 K5 K1 K3 K5
K1 76.27 45.60 50.07 89.97 65.37 54.09 70.98 35.45 39.31 75.67 55.30 47.53 74.10 52.07 45.61
K3 47.76 93.60 76.04 50.67 99.38 89.84 47.63 95.68 71.51 47.72 94.43 81.12 51.59 94.36 80.07
K5 30.41 65.77 96.67 42.49 93.29 99.05 38.49 89.47 93.28 36.70 68.17 100.0 41.55 70.83 100.0
Single view gives quite good results (K3, K5). The view K1
gives the worst result because the hands are occluded or
out of camera field of view, movement of the hand is not
discriminative
Background has strong impact on classification result.
Optical Flow gives competitive performance w.r.t RGB
Combined RGB and OF can boost performance
Page 21
Experimental results (cont’d)
Conv3 layer
Input
Conv3 layer
Test K1 K3 K5 K1 K3 K5 K1 K3 K5
Page 25
Conclusions and future works
Conclusions
– The performance of C3D remains stable under a small change of
viewpoint (<= 45 degrees)
– Background has strong impact on classification performance
(increased 11.32%)
– Incorporating Optical Flow (OF) channel in a two streams C3D
gives improved results (increased 3.11%)
Future works
– Evaluate on remaining views (K2, K4)
– Comparison with existing method [6] (manifold based learning)
– Testing with automatic segmented hand regions.
– Adapting the C3D to be more robust to viewpoint change
Page 26
Thank you for your attention !
Page 27
Appendix: Related works
Molchanov et al., Online Detection and
Molchanov P., Gupta S., Kim K., et al.
Classification of Dynamic Hand Gestures
(2015). Hand gesture recognition with 3D
with Recurrent 3D Convolutional Neural
convolutional neural networks.
Networks
-
Le T.L., Nguyen V.N., Tran T.T.H., et al. (2013). Temporal gesture segmentation for
Page 29 recognition. 2013 International Conference on Computing, Management and
Telecommunications (ComManTel), 369–373, 369–373