Sie sind auf Seite 1von 29

Hand gesture recognition using

3D Convolutional Neural Network


under different viewpoints

Student : Manh-Truong Dang


Student ID : 20134209 – K58 PFIEV SIC
Supervisor : Dr. Thanh-Hai Tran

Hanoi – 06/2018
Introduction
 Hand gesture recognition
Video stream (already temporally segmented)

On/Off
Dynamic hand Back
Gesture Next
gesture
label Increase
recognition Decrease
...

 Challenges:
- Hand: low resolution, deformable, high degrees of freedom
- Camera: view point change
- Subject: different style of performing the same gestures
(including phase variation)
Page  2
- Environment: complex background, illumination variation
Introduction
 Hand gesture recognition
Video stream (already temporally segmented)

On/Off
Dynamic hand Back
Gesture Next
gesture
label Increase
recognition Decrease
...

 Challenges:
- Hand: low resolution, deformable, high degrees of freedom
- Camera: view point change
- Subject: different style of performing the same gestures
(including phase variation)
Page  3
- Environment: complex background, illumination variation
Motivation
 Example of viewpoint change and complex
background

View 1

Hand gesture recognition should be robust to viewpoint changes


and complex background in order to be applied to practical situation

But no existing work has evaluated in detail


the robustness of recognition method to such variation

View 2

Page  4
Objectives
 Study one advanced method of hand gesture recognition
– 3D convolutional neural network [1] (suitable and efficient for
video based tasks)
– and its extension to two streams [2]
 Evaluate the robustness of this method to viewpoint
change and complex background
– working on a multi-modal multi-view dataset of dynamic hand
gestures in an application of controlling home appliance using
hand gestures
[1] Tran, Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. "Learning
spatiotemporal features with 3d convolutional networks." In 2015 IEEE International Conference on
Computer Vision (ICCV), pp. 4489-4497.

[2] Van-Minh Khong, Thanh-Hai Tran, Improving human action recognition with two-stream 3D
convolutional neural network, The first International Conference on Multimedia analysis and Pattern
Recognition, Hochiminh City, April. 6-7, 2018
Page  5
Proposed study (1)
3D convolutional Neural Network (C3D) [1]
 Main ideas
– 2D CNN (convolutional neural network) is suitable for still
image based tasks as it represents spatial correlative information
in an image
Example: ImageNet,....
– 3D convolution kernels can exploit temporal pattern besides
spatial information, while eliminating the need for secondary
temporal modeling techniques => C3D net uses 3D convolution

Page  6
Proposed study (1)
3D convolutional Neural Network (C3D) [1]
 C3D architecture
– 8 conv. layers, 5 pooling layers, 2FC layers
– Kernel size: 3x3x3, all pooling size: 2x2x2 except first pooling:
1x2x2
– Number of feature maps for convolutional layer: 64, 128, 256,
256, 512, 512, 512, 512
– Input: a 16-frame clip, each frame of size 128 x 171

Page  7
Proposed study (2)
Two stream C3D [2]
 Main ideas:
– C3D uses only RGB data
– Optical Flow or other stream could be additional information for
recognition => two streams C3D (RGB+Optical Flow)
 Two stream C3D architecture:

– Early fusion: FC-6 features are L2-normalized, concatenated and


Both [1] and [2] work only on human action recognition
then classified by an (UCF101,
SVM HMBD51)
– Late fusion: Predictions of both streams (prob layer) are
... no evaluation
averaged of C3D
to form the to hand
final gestures under different viewpoints
prediction
Page  8
Proposed study (3)
Deploy [1] and [2] into hand gesture recognition

 Main ideas:
– Pre-processing of hand gesture data
– Investigation of different methods for phase variation problem
– Fine-tune parameters of two streams C3D on human gestures
dataset under different viewpoints
– Investigation of the effects of complex background on recognition
Page  9
accuracy
Proposed study (3.1)
Pre-processing hand gesture data
 Manually spotting of gestures from stream
 Computing Optical Flow stream based on [3]
– Optical flow: characterize movement of pixels between consecutive
images I (u, v, 2k  1)  dxk 1 (u, v) I (u, v, 2 k)  dyk 1 (u, v)

Two consecutive frames and two optical flows in vertical and horizontal dimensions

 Stack Optical Flow as a 3D volume (dx, dy, 0)


Page  10
[3] http://www.robots.ox.ac.uk/~vgg/software/two_stream_action/
Proposed study (3.2)
Investigation of different sampling methods
 Phase variation: the length of a gesture class could be
different from subject to subject
 Solution of original C3D and two streams C3D:
– Divide the total video into non overlapped 16frames clips
– Input each 16 frames clip into C3D to generate a feature vector
or a score
– Average the feature vectors / scores of all 16 frames clips for
further classification or decision making
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Duplicate the last frame
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 23 23 23 23 23 23 23 23 23

The first 16 frame clip The second 16 frame clip

Page  11
Proposed study (3.2)
Investigation of different sampling methods
 My solution to avoid “static effect”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Take previous overlapped
frames
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

The first 16 frame clip The second 16 frame clip

 Other solutions
– Randomly select 16 frames from the video [4]
– Select 16 most discriminative keyframes [5]

[4] Jing L., Yang X., and Tian Y. (2018). Video you only look once: Overall temporal convolutions for action
recognition. J Vis Commun Image Represent, 52, 58–65
[5] Van-Ngoc Nguyen, Thanh-Hai Tran, Thi-Lan Le, Van-Toi Nguyen, Thi-Thuy Nguyen, Temporal gesture
segmentation for recognition, The International Conference on Computing, Management and
Telecommunications (ComManTel 2013), 21-24 January 2013, Ho Chi Minh city, VietNam
Page  12
Proposed study (3.3)
Fine tuning two streams C3D
 Fine tuning on RGB stream
– Initialize RGB-C3D with C3D model pre-trained on Sport1M
– Fine tune
• several layers of RGB-C3D using gesture dataset
• All layers of RGB-C3D using gesture dataset
Kinect 3

FCs only All layers


64.10% 94.00%

Use finetuing of all layers of RGB-C3D for evaluation


Page  13
Proposed study (3.3)
Fine tuning two streams C3D
 Fine tuning on RGB stream (all layers)

Page  14
Proposed study (3.3)
Fine tuning two streams C3D
 Fine tuning the overall OF-C3D using hand gesture dataset
on Optical Flow stream
– Initialize OF-C3D with C3D model pre trained on Sport1M (RGB)
– Initialize OF-C3D with C3D model pre trained on UCF (Optical
flow computed from UCF101 [2])
Method (on K5) Sport1M OF_UCF101

Only OF 93.28% 89.88%

Combined OF-RGB 99.33% 100%


(early fusion)
Combined OF-RGB 93.11% 100%
(latefusion)
In only OF-C3D, using pre-trained on Sport1m appears to be better.
Page  15

However, for two-stream C3D it is better to use UCF101


Proposed study (3.3)
Fine tuning two streams C3D
 Fine tuning the overall OF-C3D using hand gesture dataset
on Optical Flow stream
– Initialize OF-C3D with C3D model pre trained on UCF (Optical
flow computed from UCF101 [2])

Page  16
Proposed study (3.3)
Investigation of the effects of complex background
on recognition accuracy
 Comparison with hand-segmented RGB frames

If there is a drastic increase in accuracy, we can conclude that


the environment does have an effect on recognition accuracy

Use a separate phase to extract


Page  17
regions of interest before recognition
Experimental setup
 Datasets ON_OFF
LEFT
– 5 gestures RIGHT

– 5 subjects UP
DOWN
– 5 views (only K1, K3, K5 are annotated)

K5
K2
1.5 m K3

K1 45o K4
Page  18
Experimental setup (cont’d)
 Datasets ON_OFF
LEFT
– 5 gestures RIGHT

– 5 subjects UP
DOWN
– 5 views (only K1, K3, K5 are annotated)

View\Gesture G1 G2 G3 G4 G5

K1 26 22 33 26 23

K3 26 22 33 26 23

K5 26 22 33 26 23

Page  19
Experimental setup (cont’d)
 Evaluation procedure: leave-one-out cross
validation and cross-view [6]

Cross-view:
- Train: View Ki, subjects {P1,..P5} / {Pj}
- Test: View Kk, subjects Pj
5 subjects (P1,..P5)
3 views
(K1, K3, K5) Leave-one-out:
5 gestures - Train: View Ki, subjects {P1,..P5} / {Pj}
- Test: View Ki, subjects Pj

Evaluation metric:

Page  20 [6] Doan H.-G., Vu H., and Tran T.-H. (2017). Dynamic hand gesture recognition from cyclical
hand pattern. IEEE, 97–100, 97–100
Experimental results
Train RGB Segmented OF OF-RGB OF-RGB
RGB early Late
Test K1 K3 K5 K1 K3 K5 K1 K3 K5 K1 K3 K5 K1 K3 K5

K1 76.27 45.60 50.07 89.97 65.37 54.09 70.98 35.45 39.31 75.67 55.30 47.53 74.10 52.07 45.61

K3 47.76 93.60 76.04 50.67 99.38 89.84 47.63 95.68 71.51 47.72 94.43 81.12 51.59 94.36 80.07

K5 30.41 65.77 96.67 42.49 93.29 99.05 38.49 89.47 93.28 36.70 68.17 100.0 41.55 70.83 100.0

Avr 64.68 76.01 64.64 67.40 67.79

 Single view gives quite good results (K3, K5). The view K1
gives the worst result because the hands are occluded or
out of camera field of view, movement of the hand is not
discriminative
 Background has strong impact on classification result.
 Optical Flow gives competitive performance w.r.t RGB
 Combined RGB and OF can boost performance
Page  21
Experimental results (cont’d)

Ground truth: DOWN Prediction: ON_OFF


This is because the hand is opened before it goes down.
Page  22 Therefore C3D mistakes this with ON_OFF
Experimental results (cont’d)
Input

Conv3 layer

It is hard to discriminate between ON_OFF, LEFT, and RIGHT


Page  23
in the Conv3 layer. This only happens in K1 view
Experimental results (cont’d)
RGB Optical flow

Input

Conv3 layer

Here in this example incorporating optical flow Information


Page  24
helps us better recognize the action in the conv3 layer
Experimental results (cont’d)

Train 16 frame clips [1] 16 randomly 16 keyframes


selected [4] [5]

Test K1 K3 K5 K1 K3 K5 K1 K3 K5

K1 76.27 45.60 50.07 75.59 56.60 41.97 71.15 51.79 33.61

K3 47.76 93.60 76.04 48.78 95.34 77.45 47.98 95.34 78.79

K5 30.41 65.77 96.67 36.15 56.25 97.33 38.31 61.51 96.67

Avr 64.68 65.05 63.90

16 randomly selected gives the best results while


taking the smallest computational time.

Page  25
Conclusions and future works
 Conclusions
– The performance of C3D remains stable under a small change of
viewpoint (<= 45 degrees)
– Background has strong impact on classification performance
(increased 11.32%)
– Incorporating Optical Flow (OF) channel in a two streams C3D
gives improved results (increased 3.11%)
 Future works
– Evaluate on remaining views (K2, K4)
– Comparison with existing method [6] (manifold based learning)
– Testing with automatic segmented hand regions.
– Adapting the C3D to be more robust to viewpoint change
Page  26
Thank you for your attention !

Page  27
Appendix: Related works
Molchanov et al., Online Detection and
Molchanov P., Gupta S., Kim K., et al.
Classification of Dynamic Hand Gestures
(2015). Hand gesture recognition with 3D
with Recurrent 3D Convolutional Neural
convolutional neural networks.
Networks

- Input is interleaved intensity - A recurrent neural network (RNN) is


and depth channels used that performs simultaneous
- Trained on a High-resolution detection and classification of
network (HRN) and a Low- dynamic hand gestures
resolution network (LRN) to - Connectionist temporal
And many
capture both coarse and classification (CTC) to predict in
others ...
fine-grained information unsegmented input streams
Page  28
Appendix: Keyframe sampling
 Keyframe sampling:
This strategy finds the frames with the highest L2
distances, i.e find Ii and Ij such that

is the largest. The chosen frames are then sorted


temporally.

-
Le T.L., Nguyen V.N., Tran T.T.H., et al. (2013). Temporal gesture segmentation for
Page  29 recognition. 2013 International Conference on Computing, Management and
Telecommunications (ComManTel), 369–373, 369–373

Das könnte Ihnen auch gefallen