PROBLEM STATEMENT MULTI LAYERED RDF FOR HAND POSE THE EXPERT NETWORKS GEN vs. LEN SINGLE LAYERED RDF FOR HAND POSE UNSUPERVISED CLUSTERING OF IMAGES HAND SHAPE CLASSIFICATION WITH RDF Hand Skeleton Estimation and Hand Shape Recognition are important for human computer interaction and sign language recognition.
Must be real-time, non-invasive, robust to illumination changes. Solution: Use a depth sensor such as Kinect Previous work: Oikonomidis et al. demonstrated that tracking of the interaction of two articulated hands in real time is possible [1] -> ~20 fps, requires tracking Keskin et al. proposed real-time hand skeleton estimation using Randomized Decision Forests (RDF) [2], similar to the body pose estimation method of Shotton et al. [3] -> High mem. req., limited poses Aim: Design a method that has higher accuracy, for more general hand poses.
Randomized Decision Forest used for classification of depth pixels into hand shape parts. ... F < > F F F F F F < < > > 20 DOF Hand Model 21 Hand Parts Manually posed RDF Classification Mean Shift Scoring Dataset Generation Tool: Synthesize realistic depth images and ground truth labels by animating the hand model between manually posed key frames. Rotated to ensure view independence. Problem: It is hard to generate every possible hand pose. The dataset size is huge. 350,000 images for only Americal Sign Language (ASL) letters, using a single model. It is hard to capture the immense variation in the set. Requires more and deeper trees, extreme amounts of RAM needed. Solution: Divide to simpler sub-problems and solve them separately
We clustered the images. So we need a Hand Shape Classifier. x: Pixel location I: Depth image Q: # Clusters q: Cluster index 2nd Layer 1st Layer Train an RDF to retrieve hand shape class posterior likelihoods for an image. Train an expert RDF for each hand shape class Use the mixture of part label posteriors estimated by the experts, by using shape posteriors as weights Since synthetic images are used for hand pose training, skeleton configuration of each image is known. We use spectral clustering Similarity matrix D formed from weighted sum of absolute angle differences Q clusters selected by applying k-means to the eigenvectors of P. Weights parameterized by a single variable and optimized to maximize accuracy small : palm is more important large : finger tips are more important We found that differences in the higher levels of the skeleton hierarchy (palm, global rotation) are harder to learn by the RDF. Because we use features that are only invariant to depth!
Hand Image Classification equivalent to Hand Shape Classification Idea: Instead of feeding part labels to the RDF
Give labels to images and use them directly in RDF
Classification: Each pixel votes for the hand shape class. Take the average.
Two possibilities for connecting layers Local Decision vs. Global Consensus
Local Expert Network (LEN) Global Expert Network (GEN) Better generalization & Entirely parallel More robust to noise & Less training Tested on a large dataset collected with Kinect (65,000+ images from 5 users) Pugeault et al. Spelling It Out: Real Time ASL Finger Spelling Recognition
4 Trees Leave-one-subject-out 5x2 Cross-validation Source Depth 20 84.3% 99.1% of Confusion Extremely fast: 400 fps on the CPU, 4000 fps on the GPU EXPERIMENTS ON HAND SHAPE 81,9 96,2 98 0 20 40 60 80 100 Q = 5 Q = 15 Q = 25 EXPERIMENTS ON HAND POSE Tested on a sythetic dataset of 60,000 images, generated from ASL letters. Per-pixel classification rates reported below: 1st layer: Hand Shape Classifier RDF, 5 trees of depth 20 (Initial) accuracy increases with Q We assign each misclassified image to its new label: Model Based Clustering
2nd layer: Hand Pose Estimator RDFs, Q forests with 3 trees each, of depth 20, tested as LEN and GEN
Huge gains in accuracy Twice the time complexity 40 tests per pixel with much less memory Trade-off between memory and accuracy 68 75,2 82,6 90,9 76 83,1 91,2 0 20 40 60 80 100 Single Layered LEN Q = 5 LEN Q = 15 LEN Q = 25 GEN Q = 5 GEN Q = 15 GEN Q = 25