Real-Time Human Pose Recognition in Parts From Single Depth Images

Real-Time Human Pose Recognition in Parts from Single Depth Images Cook Jamie Shotton Andrew Fitzgibbon Mat
Toby Sharp Mark Finocchi Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation CVPR 2011 Best Paper
2/26/13
2/26/13
OUTLINE

Introduction Data Body Part Inference and Joint Proposals Experiments Discussion
2/26/13
Introduction
Robust interactive human body tracking
gaming, human-computer interaction, security, telepresence, health-care tracking from frame to frame but struggle to re-initialize quickly and so are not robust
Real time depth cameras
Our focus on per-frame initialization + 2/26/13 tracking algorithm
Introduction
appropriate tracking algorithm
Tracking people with twists and exponential maps (CVPR 1998) Tracking loose limbed people (CVPR 2004) Nonlinear body pose estimation from depth images (DAGM
2005)
Real-time hand-tracking with a color glove (ACM 2009) Real time motion capture using a single time-of-flight camera (CVPR 2010)
2/26/13
Introduction
inspired by recent object recognition work that divides objects into parts
Object class recognition by unsupervised scale-invariant learning [CVPR 2003] The layout consistent random field for recognizing and segmenting partially occluded objects [CVPR 2006]
Two key design goals
Computational efficiency robustness
2/26/13
Introduction
Depth Image
segme nt
dense probabilistic body part labeling + spatially localized near skeletal joints
genera te
3D proposal
2/26/13
Introduction
We treat the segmentation into body parts as a per-pixel classification task
Evaluating each pixel separately generate realistic synthetic depth images train a deep randomized decision forest classifier
Training data
2/26/13
avoid overfitting
Introduction
Overfitting
2/26/13
Introduction
For further speed, the classifier can be run in parallel on each pixel on a GPU mean shift resulting in the 3D joint proposals
2/26/13
What is Mean Shift ?

A tool for: Finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in RN PDF in feature space Color space Non Scale space parametric Density Actually any feature space you can Estimation Discrete PDF conceive Representation
Dat a
Non-parametric Density GRADIENT Estimation (Mean Shift)
2/26/13
PDF Analysis
Intuitive Description
Region of interest Center of mass
Mean Shift vector
2/26/13
Objective : Find the densest Distribution of identical billiard balls region
Mean Shift vector
2/26/13
Mean Shift vector
2/26/13
Mean Shift vector
2/26/13
Mean Shift vector
2/26/13
Mean Shift vector
2/26/13
2/26/13
Main contribution
Treat pose estimation as object recognition
using a novel intermediate body parts representation spatially localize joints low computational cost and high accuracy
2/26/13
Experiments
(i) synthetic depth training data is an excellent proxy for real data
(ii) scaling up the learning problem with varied synthetic data is important for high accuracy (iii) our parts-based approach generalizes better than even an oracular exact nearest neighbor
2/26/13
Data
Depth imaging and Motion capture data Pose estimation research
often focused on techniques lack of training data color pose
Two problems on depth image

2/26/13
Depth image
Use real mocap data
Retargetted to a variety of base character models to synthesize a large, varied dataset 640x480 image at 30 frames per second
Depth cameras > Traditional intensity sensors
working in low light levels
giving a calibrated scale estimate

silhouette ambiguities in pose
2/26/13 resolving
Motion capture data
capture a large database of motion capture (mocap) of human actions
approximately 500k frames (driving, dancing, kicking, running, navigating menus)
Need not record mocap with variation in rotation
vertical axis, mirroring left-right, scene position of which can be addedin
body shape and size, camera pose all 2/26/13
Motion capture data
The classifier uses no temporal information
static poses not motion
frame to the next are so small as to be insignificant
using furthest neighbor clustering algorithm where the distance between poses body joints , Pi mean i pose
2/26/13mean j
Motion capture data
necessary to iterate the process of motion capture
sampling from our model training the classifier testing joint prediction accuracy
CMU mocap database
2/26/13
Generating synthetic data
build a randomized rendering pipeline
sample fully labeled training images realism and variety
Goals
2/26/13
First : randomly samples a set of parameters

Then uses standard computer graphics techniques
render depth and body part images from texture mapped 3D meshes slight random variation in height weight give extra coverage of body
Use autodesk motionbulider
and 2/26/13
2/26/13
Body Part Inference and Joint Proposals

Body part labeling Depth image features Randomized decision forests Joint position proposals
2/26/13
Body part labeling
intermediate body part representation
as color-coded Some directly localize particular skeletal joints others fill the gaps
transforms the problem into one that can readily be solved by efficient classification algorithms
2/26/13
Body part labeling
The parts are specified in a texture map
2/26/13
Body part labeling
31 body parts:
LU/RU/LW/RW head, neck, L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee, L/R ankle, L/R foot (Left, Right, Upper, loWer)
2/26/13
Depth image features
di (x) is the depth at pixel x in image I = (u, v) describe offsets u and v 1/di (x) ensures the features are depth invariant
2/26/13
Individually these features provide only a weak signal
combination in a decision forest
sufficient to accurately disambiguate all trained parts
2/26/13
The design of these features was strongly motivated by their computational efficiency
no preprocessing is needed read at most 3 image pixels at most 5 arithmetic operations straightforwardly implemented on the GPU
2/26/13
Randomized decision forests
fast and effective multi-class classifiers Implemented efficiently on the GPU 1
2/26/13
2/26/13
2/26/13
Joint position proposals
generate reliable proposals for the positions of 3D skeletal joints
the final output of our algorithm used by a tracking algorithm to self initialize and recover from failure
2/26/13
A local mode-finding approach based on mean shift with a weighted Gaussian kernel
^x is the reprojection of image pixel xi

i
bc is a learned per-part bandwidth world space given depth dI (xi)
2/26/13
Non-Parametric Density Estimation

Assumption : The data points are sampled from an underlying PDF
Data point density implies PDF value !
Assumed Underlying PDF

2/26/13
Real Data Samples
Non-Parametric Density Estimation

2/26/13
Real Data Samples
Non-Parametric Density ? Estimation

2/26/13
Real Data Samples
Parametric Density Estimation

Assumption : The data points are sampled from an underlying PDF ( x- i ) 2 2 i 2 i i
PDF(x) =
c e
Estima te Assumed Underlying PDF

2/26/13
Real Data Samples
Wic considers both the inferred body part probability at the pixel and the world surface area of the pixel
2/26/13
The detected modes
lie on the surface of the body pushed back into the scene by a learned z offset produce a final joint position proposal

Bandwidth Bc = 0.065m Threshold c = 0.14 Z offset = 0.039m Set = 5000 images by grid search
2/26/13
2/26/13
Experiments
provide further results in the supplementary material
3 trees, 20 deep, 300k training images per tree 2000 training example pixels per image 2000 candidate features 50 candidate thresholds per feature
2/26/13
Experiments
Test data
challenging synthetic and real depth images to evaluate our approach synthesize 5000 depth images 8808 frames of real depth images 15 different subjects 7 upper body joint positions
Real test set

2/26/13
Experiments
Error metric:
quantify both classification
average of the diagonal of the confusion matrix between the ground truth part label
and the most likely inferred part label
Joint prediction accuracy
generate recall-precision curvesas a function of confidence threshold quantify accuracy as average precision per
2/26/13
Experiments
Error metric:
This penalizes multiple spurious detections Near the correct position which might slow a downstream tracking algorithm
D = 0.1 m below closed real test data
2/26/13
Experiments
2/26/13
Experiments
2/26/13
Experiments
2/26/13
Experiments
2/26/13
Experiments
2/26/13
Experiments
Real time motion capture using a single time-of-flight camera. [CVPR 2010]
2/26/13
Discussion
accurate proposals
for the 3D locations of body joints
super real-time from single depth images as an intermediate representation train very deep decision forests invariant features without
body part recognition
a highly varied synthetic training set
2/26/13 Depth
Future work
study of the variability in the source mocap data

Generative model underlying the synthesis pipeline a similarly efficient approach
directly regress joint positions remove ambiguities in local pose
2/26/13
Thank you
2/26/13

Real-Time Human Pose Recognition in Parts From Single Depth Images

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Real-Time Human Pose Recognition in Parts From Single Depth Images

Hochgeladen von

Copyright:

Verfügbare Formate

Real-Time Human Pose Recognition in Parts from Single Depth Images Cook Jamie Shotton Andrew Fitzgibbon Mat

Robust interactive human body tracking

Real time depth cameras

Our focus on per-frame initialization + 2/26/13 tracking algorithm

appropriate tracking algorithm

Two key design goals

Computational efficiency robustness

We treat the segmentation into body parts as a per-pixel classification task

What is Mean Shift ?

Non-parametric Density GRADIENT Estimation (Mean Shift)

Mean Shift vector

Objective : Find the densest Distribution of identical billiard balls region

Mean Shift vector

Objective : Find the densest Distribution of identical billiard balls region

Mean Shift vector

Objective : Find the densest Distribution of identical billiard balls region

Mean Shift vector

Objective : Find the densest Distribution of identical billiard balls region

Mean Shift vector

Objective : Find the densest Distribution of identical billiard balls region

Mean Shift vector

Objective : Find the densest Distribution of identical billiard balls region

Objective : Find the densest Distribution of identical billiard balls region

Treat pose estimation as object recognition

Depth imaging and Motion capture data Pose estimation research

often focused on techniques lack of training data color pose

Two problems on depth image

Use real mocap data

Depth cameras > Traditional intensity sensors

working in low light levels

giving a calibrated scale estimate

Motion capture data

capture a large database of motion capture (mocap) of human actions

approximately 500k frames (driving, dancing, kicking, running, navigating menus)

Need not record mocap with variation in rotation

vertical axis, mirroring left-right, scene position of which can be addedin

body shape and size, camera pose all 2/26/13

Motion capture data

The classifier uses no temporal information

static poses not motion

frame to the next are so small as to be insignificant

Motion capture data

necessary to iterate the process of motion capture

CMU mocap database

Generating synthetic data

build a randomized rendering pipeline

sample fully labeled training images realism and variety

Generating synthetic data

First : randomly samples a set of parameters

Use autodesk motionbulider

Generating synthetic data

Body Part Inference and Joint Proposals

Body part labeling

intermediate body part representation

Body part labeling

The parts are specified in a texture map

Body part labeling

Depth image features

Depth image features

Individually these features provide only a weak signal

combination in a decision forest

sufficient to accurately disambiguate all trained parts

Depth image features