Sie sind auf Seite 1von 60

Real-Time Human Pose Recognition in Parts from Single Depth Images Cook Jamie Shotton Andrew Fitzgibbon Mat

Toby Sharp Mark Finocchi Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation CVPR 2011 Best Paper

2/26/13

2/26/13

OUTLINE

Introduction Data Body Part Inference and Joint Proposals Experiments Discussion

2/26/13

Introduction

Robust interactive human body tracking

gaming, human-computer interaction, security, telepresence, health-care tracking from frame to frame but struggle to re-initialize quickly and so are not robust

Real time depth cameras

Our focus on per-frame initialization + 2/26/13 tracking algorithm

Introduction

appropriate tracking algorithm

Tracking people with twists and exponential maps (CVPR 1998) Tracking loose limbed people (CVPR 2004) Nonlinear body pose estimation from depth images (DAGM
2005)

Real-time hand-tracking with a color glove (ACM 2009) Real time motion capture using a single time-of-flight camera (CVPR 2010)

2/26/13

Introduction

inspired by recent object recognition work that divides objects into parts

Object class recognition by unsupervised scale-invariant learning [CVPR 2003] The layout consistent random field for recognizing and segmenting partially occluded objects [CVPR 2006]

Two key design goals

Computational efficiency robustness

2/26/13

Introduction

Depth Image

segme nt

dense probabilistic body part labeling + spatially localized near skeletal joints

genera te

3D proposal

2/26/13

Introduction

We treat the segmentation into body parts as a per-pixel classification task

Evaluating each pixel separately generate realistic synthetic depth images train a deep randomized decision forest classifier

Training data

2/26/13

avoid overfitting

Introduction

Overfitting

2/26/13

Introduction

For further speed, the classifier can be run in parallel on each pixel on a GPU mean shift resulting in the 3D joint proposals

2/26/13

What is Mean Shift ?


A tool for: Finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in RN PDF in feature space Color space Non Scale space parametric Density Actually any feature space you can Estimation Discrete PDF conceive Representation

Dat a

Non-parametric Density GRADIENT Estimation (Mean Shift)

2/26/13

PDF Analysis

Intuitive Description
Region of interest Center of mass

Mean Shift vector

2/26/13

Objective : Find the densest Distribution of identical billiard balls region

Intuitive Description
Region of interest Center of mass

Mean Shift vector

2/26/13

Objective : Find the densest Distribution of identical billiard balls region

Intuitive Description
Region of interest Center of mass

Mean Shift vector

2/26/13

Objective : Find the densest Distribution of identical billiard balls region

Intuitive Description
Region of interest Center of mass

Mean Shift vector

2/26/13

Objective : Find the densest Distribution of identical billiard balls region

Intuitive Description
Region of interest Center of mass

Mean Shift vector

2/26/13

Objective : Find the densest Distribution of identical billiard balls region

Intuitive Description
Region of interest Center of mass

Mean Shift vector

2/26/13

Objective : Find the densest Distribution of identical billiard balls region

Intuitive Description
Region of interest Center of mass

2/26/13

Objective : Find the densest Distribution of identical billiard balls region

Main contribution

Treat pose estimation as object recognition

using a novel intermediate body parts representation spatially localize joints low computational cost and high accuracy

2/26/13

Experiments

(i) synthetic depth training data is an excellent proxy for real data
(ii) scaling up the learning problem with varied synthetic data is important for high accuracy (iii) our parts-based approach generalizes better than even an oracular exact nearest neighbor

2/26/13

Data

Depth imaging and Motion capture data Pose estimation research

often focused on techniques lack of training data color pose

Two problems on depth image


2/26/13

Depth image

Use real mocap data

Retargetted to a variety of base character models to synthesize a large, varied dataset 640x480 image at 30 frames per second

Depth cameras > Traditional intensity sensors

working in low light levels

giving a calibrated scale estimate


silhouette ambiguities in pose

2/26/13 resolving

Motion capture data

capture a large database of motion capture (mocap) of human actions

approximately 500k frames (driving, dancing, kicking, running, navigating menus)

Need not record mocap with variation in rotation

vertical axis, mirroring left-right, scene position of which can be addedin

body shape and size, camera pose all 2/26/13

Motion capture data

The classifier uses no temporal information

static poses not motion

frame to the next are so small as to be insignificant

using furthest neighbor clustering algorithm where the distance between poses body joints , Pi mean i pose

2/26/13mean j

Motion capture data

necessary to iterate the process of motion capture

sampling from our model training the classifier testing joint prediction accuracy

CMU mocap database

2/26/13

Generating synthetic data

build a randomized rendering pipeline

sample fully labeled training images realism and variety

Goals

2/26/13

Generating synthetic data

First : randomly samples a set of parameters


Then uses standard computer graphics techniques

render depth and body part images from texture mapped 3D meshes slight random variation in height weight give extra coverage of body

Use autodesk motionbulider

and 2/26/13

Generating synthetic data

2/26/13

Body Part Inference and Joint Proposals


Body part labeling Depth image features Randomized decision forests Joint position proposals

2/26/13

Body part labeling

intermediate body part representation

as color-coded Some directly localize particular skeletal joints others fill the gaps

transforms the problem into one that can readily be solved by efficient classification algorithms
2/26/13

Body part labeling

The parts are specified in a texture map

2/26/13

Body part labeling

31 body parts:

LU/RU/LW/RW head, neck, L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee, L/R ankle, L/R foot (Left, Right, Upper, loWer)

2/26/13

Depth image features

di (x) is the depth at pixel x in image I = (u, v) describe offsets u and v 1/di (x) ensures the features are depth invariant

2/26/13

Depth image features

Individually these features provide only a weak signal

combination in a decision forest

sufficient to accurately disambiguate all trained parts

2/26/13

Depth image features

The design of these features was strongly motivated by their computational efficiency

no preprocessing is needed read at most 3 image pixels at most 5 arithmetic operations straightforwardly implemented on the GPU

2/26/13

Randomized decision forests

Randomized decision forests

fast and effective multi-class classifiers Implemented efficiently on the GPU 1

2/26/13

Randomized decision forests

2/26/13

Randomized decision forests

2/26/13

Joint position proposals

generate reliable proposals for the positions of 3D skeletal joints

the final output of our algorithm used by a tracking algorithm to self initialize and recover from failure

2/26/13

Joint position proposals

A local mode-finding approach based on mean shift with a weighted Gaussian kernel

^x is the reprojection of image pixel xi


i

bc is a learned per-part bandwidth world space given depth dI (xi)

2/26/13

Non-Parametric Density Estimation


Assumption : The data points are sampled from an underlying PDF
Data point density implies PDF value !

Assumed Underlying PDF


2/26/13

Real Data Samples

Non-Parametric Density Estimation

Assumed Underlying PDF


2/26/13

Real Data Samples

Non-Parametric Density ? Estimation

Assumed Underlying PDF


2/26/13

Real Data Samples

Parametric Density Estimation


Assumption : The data points are sampled from an underlying PDF ( x- i ) 2 2 i 2 i i

PDF(x) =

c e

Estima te Assumed Underlying PDF


2/26/13

Real Data Samples

Joint position proposals

Wic considers both the inferred body part probability at the pixel and the world surface area of the pixel

2/26/13

Joint position proposals

The detected modes

lie on the surface of the body pushed back into the scene by a learned z offset produce a final joint position proposal

Bandwidth Bc = 0.065m Threshold c = 0.14 Z offset = 0.039m Set = 5000 images by grid search

2/26/13

Joint position proposals

2/26/13

Experiments

provide further results in the supplementary material

3 trees, 20 deep, 300k training images per tree 2000 training example pixels per image 2000 candidate features 50 candidate thresholds per feature

2/26/13

Experiments

Test data

challenging synthetic and real depth images to evaluate our approach synthesize 5000 depth images 8808 frames of real depth images 15 different subjects 7 upper body joint positions

Real test set


2/26/13

Experiments

Error metric:

quantify both classification

average of the diagonal of the confusion matrix between the ground truth part label

and the most likely inferred part label

Joint prediction accuracy

generate recall-precision curvesas a function of confidence threshold quantify accuracy as average precision per

2/26/13

Experiments

Error metric:

This penalizes multiple spurious detections Near the correct position which might slow a downstream tracking algorithm

D = 0.1 m below closed real test data

2/26/13

Experiments

2/26/13

Experiments

2/26/13

Experiments

2/26/13

Experiments

2/26/13

Experiments

2/26/13

Experiments

Real time motion capture using a single time-of-flight camera. [CVPR 2010]

2/26/13

Discussion

accurate proposals

for the 3D locations of body joints

super real-time from single depth images as an intermediate representation train very deep decision forests invariant features without

body part recognition

a highly varied synthetic training set

2/26/13 Depth

Future work

study of the variability in the source mocap data


Generative model underlying the synthesis pipeline a similarly efficient approach

directly regress joint positions remove ambiguities in local pose

2/26/13

Thank you

2/26/13

Das könnte Ihnen auch gefallen