Anda di halaman 1dari 1

Cem Keskin Furkan Kra Yunus Emre Kara Lale Akarun

keskinc@cmpe.boun.edu.tr kiracmus@boun.edu.tr yunus.kara@boun.edu.tr akarun@boun.edu.tr



PROBLEM STATEMENT
MULTI LAYERED RDF FOR HAND POSE THE EXPERT NETWORKS GEN vs. LEN
SINGLE LAYERED RDF FOR HAND POSE
UNSUPERVISED CLUSTERING OF IMAGES
HAND SHAPE CLASSIFICATION WITH RDF
Hand Skeleton Estimation and Hand Shape Recognition are important for
human computer interaction and sign language recognition.


Must be real-time, non-invasive, robust to illumination changes.
Solution: Use a depth sensor such as Kinect
Previous work:
Oikonomidis et al. demonstrated that tracking of the interaction of two
articulated hands in real time is possible [1] -> ~20 fps, requires tracking
Keskin et al. proposed real-time hand skeleton estimation using
Randomized Decision Forests (RDF) [2], similar to the body pose
estimation method of Shotton et al. [3] -> High mem. req., limited poses
Aim: Design a method that has higher accuracy, for more general hand poses.

Randomized Decision Forest used for classification of depth pixels into hand
shape parts.
...
F
< >
F F
F F F F
< < > >
20 DOF
Hand Model
21 Hand Parts Manually
posed
RDF
Classification Mean Shift Scoring
Dataset Generation Tool: Synthesize realistic depth images and ground truth
labels by animating the hand model between manually posed key frames.
Rotated to ensure view independence.
Problem:
It is hard to generate every possible hand pose.
The dataset size is huge. 350,000 images for only Americal Sign Language
(ASL) letters, using a single model.
It is hard to capture the immense variation in the set. Requires more and
deeper trees, extreme amounts of RAM needed.
Solution:
Divide to simpler sub-problems and solve them separately





We clustered the images. So we need a Hand Shape Classifier.
x: Pixel location I: Depth image Q: # Clusters q: Cluster index
2nd Layer 1st Layer
Train an RDF to
retrieve hand shape
class posterior
likelihoods for an image.
Train an expert RDF
for each hand shape
class
Use the mixture of
part label posteriors
estimated by the
experts, by using shape
posteriors as weights
Since synthetic images are used for hand pose training, skeleton configuration
of each image is known.
We use spectral clustering
Similarity matrix D formed from weighted sum of absolute angle differences
Q clusters selected by applying k-means to the
eigenvectors of P.
Weights parameterized by a single variable and
optimized to maximize accuracy
small : palm is more important
large : finger tips are more important
We found that differences in the higher levels of the skeleton hierarchy (palm,
global rotation) are harder to learn by the RDF.
Because we use features that are only invariant to depth!

Hand Image Classification equivalent to Hand Shape Classification
Idea:
Instead of feeding part labels to the RDF



Give labels to images and use them directly in RDF

Classification: Each pixel votes for the hand shape class. Take the average.

Two possibilities for connecting layers
Local Decision vs. Global Consensus








Local Expert Network (LEN) Global Expert Network (GEN)
Better generalization & Entirely parallel More robust to noise & Less training
Tested on a large dataset collected with Kinect (65,000+ images from 5 users)
Pugeault et al. Spelling It Out: Real Time ASL Finger Spelling Recognition






4 Trees Leave-one-subject-out 5x2 Cross-validation Source
Depth 20 84.3% 99.1% of Confusion
Extremely fast: 400 fps on the CPU, 4000 fps on the GPU
EXPERIMENTS ON HAND SHAPE
81,9
96,2 98
0
20
40
60
80
100
Q = 5
Q = 15
Q = 25
EXPERIMENTS ON HAND POSE
Tested on a sythetic dataset of 60,000 images, generated from ASL
letters. Per-pixel classification rates reported below:
1st layer: Hand Shape Classifier RDF,
5 trees of depth 20
(Initial) accuracy increases with Q
We assign each misclassified image
to its new label: Model Based Clustering

2nd layer: Hand Pose Estimator RDFs,
Q forests with 3 trees each, of depth 20, tested as LEN and GEN

Huge gains in accuracy
Twice the time complexity
40 tests per pixel with
much less memory
Trade-off between
memory and accuracy
68
75,2
82,6
90,9
76
83,1
91,2
0
20
40
60
80
100
Single Layered
LEN Q = 5
LEN Q = 15
LEN Q = 25
GEN Q = 5
GEN Q = 15
GEN Q = 25

Anda mungkin juga menyukai