Anda di halaman 1dari 8

An Open Source Framework for Real-time, Incremental, Static and Dynamic Hand Gesture Learning and Recognition

Todd C. Alexander 1, Hassan S. Ahmed2, and Georgios C. Anagnostopoulos1


1

Electrical and Computer Engineering, Florida Institute of Technology, Melbourne, Florida USA talexand@fit.edu, georgio@fit.edu

Electrical Engineering, University of Miami, Miami, Florida USA h.ahmed@umiami.edu

Abstract. Real-time, static and dynamic hand gesture learning and recognition makes it possible to have computers recognize hand gestures naturally. This creates endless possibilities in the way humans can interact with computers, allowing a human hand to be a peripheral by itself. The software framework developed provides a lightweight, robust, and practical application programming interface that helps further research in the area of humancomputer interaction. Approaches that have proven in analogous areas such as speech and handwriting recognition were applied to static and dynamic hand gestures. A semi-supervised Fuzzy ARTMAP neural network was used for incremental online learning and recognition of static gestures; and, Hidden Markov models for online recognition of dynamic gestures. A simple anticipatory method was implemented for determining when to update key frames allowing the framework to work with dynamic backgrounds. Keywords: Motion detection, hand tracking, real-time gesture recognition, software framework, FAST corner detection, ART Neural Networks.

Introduction

User experience in human-computer interaction can be dramatically improved by allowing users to interact in more natural and intuitive ways. The problem is that there is no single efficient, robust, and inexpensive solution to recognize static and dynamic hand gestures, consequently interfaces maybe be less natural or intuitive. Visualizing and interpreting hand gestures the way humans are able to, will provide many new ways to interact with computers. Already applications such as sign

language recognition [1] and [2], touch-less computer interaction [3] and [4], and behavior understanding [5] have demonstrated why bare hand computer interaction plays an important role in emerging consumer electronics. Solutions used in the past to improve tracking and recognition include using gloves or markers [6], or using more expensive hardware such as infrared cameras [7]. Gloves and markers require setup and mandatory calibration, resulting in a less natural user experience. Infrared cameras improve the robustness in regard to dynamic backgrounds; however, the costs of these cameras prevent them from being widely adopted. The Gestur framework developed in this research provides a foundation for creating new, and improving existing, applications that depend on hand gesture recognition, by providing a framework that depends on minimal hardware and software resources, and the ability to operate in a vast range of conditions. In the rest of the paper, we examine the Gestur framework and its components in Section 2. Finally, Section 3 provides a summary of our work and a discussion about some of Gesturs key characteristics.

Gestur

Gestur is an open-source software framework designed and developed for software developers wishing to incorporate static and dynamic hand gesture recognition into their applications. Developers have access to the framework via its application programming interface, which provides abstractions for capture, calibrating, tracking, training, and classification in hand gesture recognition. Unlike previous libraries and toolkits [8] and [9], Gestur is primarily focused on real-time hand gesture recognition. 2.1 Software Architecture

Gestur is written in the C# programming language [10] and runs on the Microsoft.NET framework utilizing the DirectShow API [11]. DirectShow exists as an unmanaged1 framework thus in order to utilize it an unofficial managed wrapper 2 called DirectShow.net [12] was employed. The main children namespaces outlined in Fig. 1 are: Capture: works closely with DirectShow to allow any video capture device that supports the Windows Driver Model to be used as an input device. Calibration: retrieves parameters that other framework components need to perform their job reliably and efficientlyfor e.g. the finger tip radius to use in FAST corner detection. Tracking: searches and keeps track of objects of interest, namely the hand.
1

Gestur runs on Microsoft.NETa managed framework, hence memory management is done automatically. DirectShow is written in native C++ which does not provide any garbage collection. 2 DirectShow.net facilitates the interoperability between the managed and unmanaged domains.

Learning: allows training the framework for subsequent recognition and classification. Utility: provides many image processing methods to assist other objects in the framework. The majority of components present in Gestur are polymorphic in design allowing programmers to replace any component with an alternative. The framework is also fully open source making code distribution, review, and issue tracking convenient for interested developers.

Gestur
Capture Tracking Utility

Calibration

Learning

Training

Classification

Fig. 1. Software architecture of the Gestur framework. Main components interact with each other as indicated by the arrows. Utility provides general purpose image processing methods. 2.2 Hand Detection

Human beings are constantly in motion, and when about to perform a static or dynamic gesture some degree of motion is expected. Using a two frame difference, as illustrated in Fig. 2, the areas containing motion can be easily identified. The result is a binary image containing potential areas where the hand may be located, called candidate regions of interest.

Fig. 2. Example of a previous, current, and the two-frame difference of the previous and current in A, B, and C, respectively. The two-frame difference identifies regions in which motion occurred. The amount of motion being experienced also dictates, when the key/reference frame should be updated. This approach allows the framework to anticipate when to expect a hand gesture. The setting of this reference frame is very important for obtaining an outline of the hand as in Fig. 3. Every candidate region presented by the two-frame difference is investigated as a number of sub-images3 as shown in Fig. 3. The Fast Accelerated Segment Test (FAST) corner detection algorithm [13] is then performed using the suggested finger tip radius [14], as well as the objects size in relation to the sub-image. Using geometric constraints of a typical hand [15], feedback is provided on whether or not the sub-image contains a hand.

Fig. 3. Candidate region of interest containing a hand. The segmented hand is found by computing the difference between the current frame and the key/reference frame. The key frame is set, when a hand gesture is anticipated to avoid the hand being part of the frame.

2.3

Static Gestures

Once the hand has been detected and tracked, a variety of elementary, easy to compute, geometric features are extracted to determine whether a gesture is being
3

The dimensions used for each sub-image is determined during calibration.

performed (gesture detection) and, finally, to recognize the type of gesture. The rotation-, translation-, scale-, and reflection-invariance of these features was mandated to ensure robust and computationally efficient recognition. The features used with the default classifier are the number of fingers and the angle formed between the centroid of the finger tips and the forearm, illustrated as A in Fig. 4. Static gestures are learned and, subsequently, recognized by a semi-supervised Fuzzy ARTMAP (ssFAM) neural network [16-18]. ssFAM is a neural network architecture, which is built upon Grossbergs Adaptive Resonance Theory (ART). It has the ability to learn associations between a set of input patterns and their respective class/label, which then can be utilized for classification purposes. The ssFAM network clusters training patterns into categories, whose geometrical representations are hyper-rectangles embedded in the feature space. Each time a test pattern is presented a similarity measure is computed to determine, which of the discovered categories it belongs too. Via this mechanism, it also has the ability to discover, if completely new information is being presented to the network. This is the case, when a given test pattern does not fit the characteristics of any of the categories that the system has learned so far. This mechanism is useful in our application so that irrelevant or ambiguous gestures are appropriately ignored.

Fig. 4. Features used for classifying static gestures: number of finger tips, and the angle, A, between the centroid of the finger tips and the forearm. These features are rotation-, translation, scale-, and reflection-invariant.

An additional, important quality of ssFAM (and of ART-based classifiers) is the fact that the network is able to learn incrementally and, therefore, is ideal for online learning tasks. In our framework of gesture recognition this is a highly desirable feature, as it will allow our system to add new gestures or to accumulate additional knowledge about already learned gestures in an online fashion. Finally, the fact that ssFAM is capable of organizing its acquired knowledge by forming categories in a semi-supervised manner, allows the system to generalize well, when attempting to learn gestures.

2.4

Dynamic Gestures

The movement of the hand conveys important information about the message users are trying to communicate. This information is represented by the moving hands trajectory, which can be expressed as a stream of primitive movements. Because of the success of Hidden Markov Models (HMM) in applied recognition of symbol sequences [19], HMMs comprise the default classifier used in Gestur for dynamic hand gesture recognition. After the hand is detected, the center of the palm is computed and tracked. The trajectory is sampled from the beginning of the gesture until its end and a sequence of motion primitives is established, as shown in Fig. 5. This stream is fed to a bank of HMMs, where the most likely dynamic gesture is labeled as the sequence of primitives that best fit the model.

Bank of HMMs

Maximum Likelihood

Stream of Motion Primitives

c mi na re D y e s tu s G Clas

Fig. 5. Dynamic gesture classification subsystem. A stream of simple motion primitives is forwarded to a bank of HMMs, each of which is specialized to recognize a specific dynamic gesture. The dynamic gesture is labeled according to the model reporting the maximum likelihood of having generated the observed sequence.

The main challenge with classifying dynamic gestures this way is accurately identifying the beginning and end of the gesture. In Gestur, the beginning is indicated, when a hand containing a recognized static gesture is present and has moved some minimum distance. The end of the gesture is marked, when the hand comes to a stop.

Furthermore, on-line recognition of dynamic hand gestures via HMMs was made possible through utilization of an optimized version of the Viterbi algorithm [20]. Additionally, movement can only be in one of eight directionsup, down, left, right or a combination of vertical and horizontalhence this architecture remains efficient and facilitates incremental learning [21], since new gestures are often extensions of already learned models.

Discussion

Gestur provides an easy way for any software developer to quickly develop event driven applications involving hand gesture recognition. It provides robust detection and tracking in almost every situation in the home, office, conference room, etc. with ample light present. The framework is also resilient to background changes through anticipatory key frame updating. Gestur utilizes the DirectShow API to allow any image processing operations supported by the hardware to be performed on the hardware instead of having to implement such operations in software. As a result we have been able to successfully use Gestur to produce an application that controls Microsoft PowerPoint presentations, whereby users indicate to the computer the direction to advance slides, terminate a presentation, or any other action initially configured. This application also shows how static and dynamic gestures can replace computer events traditionally triggered through the use of hands, a keyboard, and a mouse. In this application we were able to classify six static, and five dynamic gestures, with an overall recognition rate of 76%. Furthermore, Gestur was designed and developed with the goal of providing a lightweight, practical, and robust framework to further research in the area of hand gesture recognition. The polymorphic design of the framework allows users to experiment with different components making it possible to study various machine learning techniques as they relate to online static and dynamic hand gesture recognition. Acknowledgements The authors would like to thank the National Science Foundation for funding much of the fundamental research through Grant No., 0647018 and 0647120. Also, Todd Alexander wishes to acknowledge partial support by the National Science Foundation under Grant No. 0717674. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. References
1. Vogler, C., Metaxas, D.: A framework for recognizing the simultaneous aspects of american sign language. Computer Vision and Image Understanding, 81, 358-384 (2001).

2. Fang, G., Gao, W., Ma, J.: Signer-independent sign language recognition based on SOFM/HMM, Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, 2001. In: Proceedings of the IEEE ICCV Workshop on RATFG in RTS, pp. 90 (2001). 3. von Hardenberg, C. and Berard, F.: Bare-hand human-computer interaction. In: Proceedings of the 2001 workshop on Perceptive user interfaces, pp. 1-8, (2001). 4. Rainer, J. M.: Fast hand gesture recognition for real-time teleconferencing applications. In: In International Workshop on Recognition, Analysis and Tracking of Faces and Gestures in Real-time Systems, pp. 133-140 (2001). 5. McAllister, G., McKenna, S. J., and Ricketts, I. W.: Hand tracking for behaviour understanding. Image Vision Computing. 20, 827-840 (2002). 6. Keskin, C., Erkan, A. and Akarun, L. Real time hand tracking and 3d gesture recognition for interactive interfaces using hmm. In Proceedings of the Joint International Conference ICANN/ICONIP. Springer (2003). 7. Breuer, P., Eckes, C., Muller, S.: Hand Gesture Recognition with a Novel IR Time-of-Flight Range CameraA Pilot Study. In: LNCS, vol. 4418, pp. 247-260. Springer, Heidelberg (2007). 8. Lyons, K., Brashear, H., Westeyn, T., Kim, J., Starner.: GART: The Gesture and Activity Recognition Toolkit. LNCS, vol. 4552, pp. 718-727. Springer, Heidelberg (2007) 9. Westeyn, T., Brashear, H., Atrash, A., Starner, T. 2003. Georgia tech gesture toolkit: supporting experiments in gesture recognition. In: Proceedings of the 5th international Conference on Multimodal interfaces. ICMI '03. ACM, New York, NY, pp. 85-92 (2003). 10.Hejlsberg, A., Wiltamuth, S., and Golde, P.: C# Language Specifcation, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (2003). 11.Microsoft DirectShow, http://msdn.microsoft.com/en-us/library/ms783323(VS.85).aspx 12. DirectShow.net Library, http://directshownet.sourceforge.net/ 13.Alkaabi, S. and Deravi, F.: Candidate pruning for fast corner detection, Electronics Letters 40, 18-19 (Jan. 2004). 14.Angel, E. and Morrison, D.; Speeding up bresenham's algorithm. Computer Graphics and Applications, IEEE 11, 16-17 (Nov 1991). 15. Jerde, T., Soechting, J., and Flanders, M.: Biological constraints simplify the recognition of hand shapes. Biomedical Engineering, IEEE Transactions on 50, 265-269 (Feb 2003). 16.Grossberg, S.: Adaptive pattern classification and universal recoding, II: feedback, expectation, olfaction and illusions. Biological Cybernetics 23, 187-202 (1976). 17.Anagnostopoulos, G., Georgiopoulos, M., Verzi, S., and Heileman, G.: Reducing generalization error and category proliferation in ellipsoid ARTMAP via tunable misclassification error tolerance: Boosted ellipsoid ARTMAP. In: Proc. IEEE-INNS-ENNS Int'l Joint Conf. Neural Networks (IJCNN'02) 3, 2650-2655 (2002). 18.Anagnostopoulos, G., Bharadwaj, M., Georgiopoulos, M., Verzi, S., and Heileman, G.: Exemplar-based pattern recognition via semi-supervised learning. Neural Networks, 2003. Proceedings of the International Joint Conference on Neural Networks, 4, 2782-2787 (2003). 19.Makhoul, J., Starner, T., Schwartz, R., and Chou, G.: On-line cursive handwriting recognition using hidden Markov models and statistical grammars. In: Proceedings of the Workshop on Human Language Technology (Plainsboro, NJ, March 08 - 11, 1994). Human Language Technology Conference. Association for Computational Linguistics, Morristown, NJ, 432-436 (1994). 20.Zen, H., Tokuda, K., Kitamura, T.: A viterbi algorithm for trajectory model derived from HMM with explicity relationship between static and dynamic features. In: Proceedings of ICASSP 2004, pp. 837-840 (2004). 21.Cavalin, P., Sabourin, R., Suen, C., Britto, A.: Evaluation of incremental learning algorithms for HMM in the recognition of alphanumeric characters. Pattern Recognition (2008).

Anda mungkin juga menyukai