Anda di halaman 1dari 6

S. Kim et al.

: Vision-based Cleaning Area Control for Cleaning Robots 685


Contributed Paper
Manuscript received 04/15/12
Current version published 06/22/12
Electronic version published 06/22/12. 0098 3063/12/$20.00 2012 IEEE
Vision-based Cleaning Area Control for Cleaning Robots
Soowoong Kim, Jae-Young Sim, Member, IEEE, and Seungjoon Yang, Member, IEEE

Abstract This paper provides a vision based HCI method
for a user to command a cleaning robot to move to a specific
location in home environment. Six hand poses are detected
from a video sequence taken from a camera on the cleaning
robot. AdaBoost based hand-pose detectors are trained with a
reduced Haar-like feature set to make the detectors robust to
the influence of the complex background. The first three stages
of the cascade in the six detectors are used as pose estimation
to reduce the computational complexity. The cleaning area is
determined from the detected pose. The performances of the
proposed detectors are validated with a set of test images with
cluttered background. The cleaning area control is simulated
with real-world video sequences. The proposed method can
effectively control a cleaning robot without the need for a user
to wear or employ any input devices
1
.

Index Terms hand-pose detection, reduced Haar-like feature
set, AdaBoost, human computer interaction, service robots.
I. INTRODUCTION
Service robots such as vacuum cleaning robots are finding
their places as household appliances. Maneuvering of robots in
home or office environment is an active research area [1]-[3].
Automatic maneuvering requires interactions between human
and robots. For example, a user of a cleaning robot can direct
the robot to move to a specific location or to perform a
specific task. Human computer interaction (HCI) technologies
can be applied to deliver users command to robots [4]-[7].
Vision-based HCI technologies are preferred for home
appliances because they do not require users to wear specific
sensors or to use specific input devices. Face and hand
detections are important building blocks for vision-based HCI.
Detection algorithms based on the machine learning methods
such as support vector machine, neural networks, and adaptive
boosting (AdaBoost) have been applied to detect faces and
hands [8]-[11]. Among these methods, AdaBoost based
algorithms are adopted in many applications for their good
performance and fast detection speed [12].
In this paper, we propose a vision-based HCI method to
control cleaning robots. We assume a cleaning robot is in
home or office environment where rooms are separated by
walls. A user points to a specific room with his or her hand,
and the robot understands the user's gesture. The proposed

1
This research was supported by basic science research program
through the National Research Foundation of Korea (NRF) funded by the
Ministry of Education, Science and Technology (2009- 0077022).
S. Kim, J. Y. Sim and S. Yang are with School of Electrical and
Computer Engineering, Ulsan National Institute of Science and Technology
(UNIST), Ulsan, Korea (e-mail: {swkim, jysim, syang}@unist.ac.kr).

method utilizes the AdaBoost algorithm to detect users face
and hand. We use six different hand postures to specify six
different commands, which are combinations of three
directions and an over the wall flag. Base on the detected
command, the robot determines which room to move to and to
clean.
Detection of typical objects such as faces, eyes, or license
plates is not usually affected by background. The rectangular
or square shape window that AdaBoost utilizes in the training
and detection phases can contain only the object of interest
without background. However, detection of objects with
irregular shapes such as hand postures can be easily affected
by the background, since background is generally included in
the window AdaBoost utilizes. The performance of a hand
detector can be degraded when it is operated in a fully
dynamic environment with cluttered background [11].
AdaBoost trained with a reduced Haar-like feature set can
reduce the advert effects of background on the performance of
the detector [13]. In this work, we adopt the use of the reduced
Haar-like feature set so that our six hand detectors can cope
with the complex backgrounds of home and office
environment. Each hand detector has nine stages of cascade in
the strong classifier. We group the first three stages together,
and use them as a pose estimation routine [14]. Once the hand
pose is determined by the pose estimation, the rest of the
stages in the cascade are applied only to the detected pose.
The grouping of the early stages of detectors as pose
estimation not only reduces the computational complexity, but
also reduces the false alarm.
The performances of the six individual hand-pose detectors
and that of the detector with pose estimation are evaluated with
a set of test images containing complex background. They both
outperform the detectors trained with the full Haar-like sets.
Experiments are performed to determine commands from the
hand postures detected from real-world video sequences. The
proposed method can provide a vision-based command tool for
service robots including vacuum cleaning robots.
This paper is organized as follows. Section II-A presents the
overview of the proposed cleaning area control system.
Section II-B introduces the training method of individual
hand-pose detectors using the reduced Haar-like feature set.
Section II-C describes how the first few stages are grouped
together to form a pose estimation routine. Section III-A
provides the performance evaluation of the individual
detectors. Section III-B provides the performance evaluation
of the detector with pose estimation. Examples of the cleaning
area control with commands extracted from real-world video
sequences are given in Section III-C. Section IV concludes
this paper.
686 IEEE Transactions on Consumer Electronics, Vol. 58, No. 2, May 2012
II. VISION-BASED CLEANING AREA CONTROL
A. System Overview
Fig. 1 shows the schematics of the proposed vision-based
cleaning area control system. A cleaning robot is equipped
with a camera facing forward. We assume the robot has a
sensor and a control routine so that it can turn to the
direction to the user and wait for the user's hand gesture. For
example, a cleaning robot can stop its task and turn to the
direction of sound when it hears a clap or a whistle by a user.
From a video sequence captured by the camera, the robot
detects a face. If no face is detected, it resumes its task. Once
a face is detected, the robot detects one of the six hand
postures. The six hand postures that represent three azimuth
angles and two altitude angles of user's arm are given in Fig.
2. Based on the detected hand posture, we determine the
angle to which the user points his or her hand and the flag
to indicate whether the user is pointing over the wall or not.
The cleaning area is selected based on the robots current
location, the angle, and the flag.
B. AdaBoost with Reduced Haar-Like Feature Set

Fig. 3. Examples of Haar-like feature template.

Fig. 1. Schematics of the proposed cleaning area control system.


Fig. 2. Classification of six hand poses based on altitude and azimuth.

The proposed method detects a face and six hand postures. All
the detection routines are based on AdaBoost [15]. In
AdaBoost, a strong classifier is built based on the cascade of
weak classifiers that use simple features selected from a
feature set through a training phase. Denote the feature set by
F
0
. The features f
I
for i = 1,2, , N in F
0
consist of Haar-like
features of various scales and locations. For an image x
I
, the
feature f
j
returns the value

]
(x

) =
]
(m, n)x

(m, n)
(m,n)B
,
(1)

where B is the set of indices given by the Cartesian
product |1, E] |1, w], and
j
is the Haar-like feature
template. Examples of the feature template are shown in
Fig. 3, where the pixel values are one, minus one, and
zero for the white, black, and grey regions, respectively.
The sizes of the image and the feature template are E
w. In the detection phase, E w size image patches are
extracted from a given image, or from scaled and rotated
given images. The cascade of selected features are
calculated and compared to the thresholds to classify
whether the image patch is the object of interest or not.
For the face detection routine, the images x

s are
classified into face or no-face. While face images of the
size E w contain only the faces, the hand images of the
size E w include the backgrounds as well as the hands.
Examples of face and hand images are shown in Fig. 4. It
can be seen that a large number of pixels in a hand image
correspond to the background. The background regions
can affect the performance of detectors. If the features
affected strongly by the background are selected in the
training phase, the performances of detectors can be
severely degraded when operated against cluttered
background.
The influence of background on the detector
performance can be alleviated by excluding the features
that are easily affected by the backgrounds [13]. In the
training phases of the hand detection routines, we
eliminate such features in advance to reduce the influence
of the background. Let x be the average of the images x

's
for i = 1,2, , N. A mask is obtained by taking the region

S. Kim et al.: Vision-based Cleaning Area Control for Cleaning Robots 687

Fig. 4. Training examples of face and hand. Rectangular shaped window
can contain face region only without a background. But hand poses
cannot be contained in a rectangular window without including
background.

Fig. 6. Examples of selected features in the strong classifiers, (a) trained
with reduced feature set, and (b) trained with full feature set.

Fig. 5. Average hand images of training examples of six hand poses and
corresponding mask images.

where the average pixel values are greater than a
threshold I
A
:

p(m, n) = _
1, if x (m, n) > I
A
.
u, otheiwise.

(2)

Two thousand hand images of the same size without
background are used for each hand posture to obtain the mask
image. The average hand images and the corresponding mask
images are given in Fig. 5. The overlapping ratio between the
mask and the ]th feature template,
]
, is calculated by


y
]
=
1
wE
|
]
(m, n)|
(m,n)B
p(m, n).
(3)

A set of features inside the average hand regions is obtained by


F
M
= |
]
|
]
F
0
anu y
]
> I
M
|.
(4)

Example images (x
I
, y
I
) for i = 1, 2, , N are prepared for
the training of the face and hand-pose detectors, where y
I
is
zero and one for the negative and positive example images,
respectively. The size of the images is S2 S2. Two thousand
positive images including various background and eight
thousand negative images are prepared for each detector.
The training of the detectors is the same as the algorithm
[12]. The set F
M
instead of the original feature set F
0
is used in
the training phase. The thresholds T
A
and T
M
are 30 and 0.99,
respectively. The number of cascade stages in each detector is
set to nine.
Fig. 6 shows an example of the selected features used in
the strong classifier. It is observed that some of the selected
features trained with the original feature set
F
0
are located at
the boundaries of hand and background regions. These
selected features will provide significantly different outputs as
the background changes. In contrast, all of the selected
features trained with the reduced feature set F
M
are all inside
the hand region. The pixels in the background will not affect
the outcome of the classifiers.

A. Hand Detectors with Pose Estimation
In the proposed method, we employ six hand-pose
detectors. In the detection phase, patches of images are
exhaustively classified as hand or no-hand. The computational
complexity of the six independently run hand-pose detectors is
six times that of a single detector. A method to reduce the
computational complexity is to utilize the confidence level
[14]. Let k = 1, 2, , 6 be the index of the six hand-pose
detectors, and l = 1, 2, ,9 be the index of the nine cascade
stages. The confidence level of the lth cascade stage of the kth
detector is defined by

C
I
k
(x

) = |
]
(x

)
]
|,

where the sum is over all the selected features in the lth stage
of the kth detector. The confidence level of the kth detector
after the first I cascade stage is obtained by

C
1:L
k
(x

) = _C
I
k
(x

)
L
I=1


The hand-pose is estimated by finding the pose with
maximum confidence level by

k
`
= aig max
k
C
1:L
k
(x

).

Once the hand-pose is obtained, only the k
`
th detector
finishes the stages from (I + 1)th to 9th to classify the input
image x

to hand or no-hand. The schematic diagram of the


hand detector with pose estimation is given in Fig. 7.

688 IEEE Transactions on Consumer Electronics, Vol. 58, No. 2, May 2012

D. Cleaning Area Control

TABLE I
SUMMARY OF CLEANING AREAS DETERMINED FROM THE DETECTED
HAND POSTURE
altitude azimuth direction flag Cleaning area
0 45 45 1
Robot's left
over the wall
45 45 45 0
Robot's left
in front of the wall
0 90 90 1
User's right
over the wall
45 90 90 0
User's right
in front of the wall
0 135 135 1
User's right behind
over the wall
45 135 135 0
User's right behind
in front of the wall

III. EXPERIMENTS
A. Performance of Individual Detectors

Fig. 8. ROC curves of detectors estimated in cluttered background test
set.
detector trained with the reduced feature set
- - - detector trained with the original feature set
(a) altitude 0 , azimuth 45 (b) altitude 45 , azimuth 45
(c) altitude 0 , azimuth 90 (d) altitude 45 , azimuth 90
(e) altitude 0 , azimuth 135 (f) altitude 45 , azimuth 135


TABLE II
MISS, FALSE AND HIT RATE OF DETECTORS TRAINED WITH REDUCED
FEATURE SET AND ORIGINAL FEATURE SET FOR 1000 CLUTTERED
BACKGROUND TEST SET
hand pose
detector trained with
reduced feature set
detector trained with
original feature set
altitude azimuth miss false hit rate Miss false hit rate
0 45 19 1608 98.10% 37 3638 96.30%
45 45 4 1657 99.60% 7 2090 99.30%
0 90 37 1492 96.30% 58 3675 94.20%
45 90 7 1952 99.30% 27 3356 97.30%
0 135 43 1902 95.70% 55 4627 94.50%
45 135 47 2008 95.30% 72 2937 92.80%


Fig. 7. Hand detector with pose estimation. First three stages are used to
estimate the pose of hand, and last six stages are used to verify the sub
image is hand or not.
The six hand-poses in the proposed method represent the
particular pointing directions. Fig. 2. shows how the azimuth
and altitude angles of user's arm are related to the six hand-
poses. The three azimuth angles are used to represent forward,
sideway, and backward directions. The two altitude angles are
used to determine if a user is indicating a room over the wall
or a room in front of the wall. Once the hand-pose is detected,
the azimuth and altitude angles are converted to the pointing
direction and the over-the-wall flag. Based on the current
location of the robot and the user, the cleaning area is
determined from the pointing direction and the flag. TABLE I
summarizes the location of cleaning area associated with the
direction and the flag.

In order to evaluate performances of the six hand-pose
detectors, a set of a thousand test images for each hand pose is
prepared. The test images consist of randomly scaled and
rotated hand images overlaid on cluttered background images.
The scaling ratio is between 1.0 and 4.0 and the rotation angle
is between 5 and 5 degrees. Fig. 8 shows the receiver
operation characteristics (ROC) of the proposed detectors,
trained with the reduced feature set

, for each hand pose


tested on the set of test images. For comparison, the ROC of


the original detectors, trained with the original feature set

,
are also shown. It can be seen that all the proposed hand
detectors outperform the original hand detectors. For given
detection rates greater than 0.9, the false alarm ratios of the
proposed detectors are smaller than those of the original
detectors. The miss, false positive, and hit rates of the
detectors tested on the test set are summarized in TABLE II.
The proposed detectors trained with the reduced feature set
provide higher hit rates than the original detectors for all the
hand postures.

S. Kim et al.: Vision-based Cleaning Area Control for Cleaning Robots 689
B. Performance of Hand Detector with Pose Estimation
Two hundreds out of a thousand test images for each of the
six hand pose detector tests are selected and combined into a
single test image set. The performance of the hand detector
with pose estimation is evaluated using this test set. The miss,
false positive, and hit rates of the detectors tested on the test
set are summarized in TABLE III. For comparison, the six
hand-pose detectors are executed separately in parallel, and
the resulting miss, false positive, and hit rates are also
provided. The cases where the detectors are trained with the
original feature sets are also given for comparison. It can be
seen that the hand detector with pose estimation provides a
higher hit rate than the individually run detectors. For the
detectors trained with the reduced feature set, the hit rate is
improved from 96.67% to 96.75%. For the detectors trained
with the original feature set the hit rate is improved from
93.67% to 94.68%. The performances of the detectors trained
with reduced feature sets are better than those trained with the
original detectors. The hit rates of the detector with pose
estimation are 96.75% when trained with the reduced set, and
94.58% when trained with the original feature set. The hit rate
of the detectors without pose estimation is 96.67% when
trained with the reduced feature set, and 94.67% when trained
with the original feature set.
We can conclude that i) the use of the reduced feature set in
the training phase makes the detectors robust to the influence
of the cluttered background, and ii) the use of the first three
stages of the cascade as pose estimation improves the
performance of the detector. Note that the computational
complexity of the detector with pose estimation is
considerably smaller because six stages of the cascade are
performed only for the one pose.

TABLE III
MISS, FALSE AND HIT RATE OF DETECTORS WITH AND WITHOUT POSE
ESTIMATION FOR A CLUTTERED BACKGROUND TEST SET
feature set type
hand detectors with pose
estimation
hand detectors without
pose estimation
miss false hit rate miss false hit rate
reduced
feature set
39 1821 96.75% 40 2528 96.67%
original
feature set
65 3245 94.58% 76 5264 93.67%

C. Cleaning Area Control
Fig. 9 shows the examples of real-world video sequences
and the determined cleaning areas. One frame of each video
sequence is shown with the detected face and hand indicated
with color circles. The location of a user is shown on a floor
map with a red rectangle. From the detected hand pose the
cleaning area the user points to is determined, and the robot
moves to the cleaning area. For example, Fig. 9 (a) is the
case when the robot moves to a room located on its left hand
side. Fig. 9 (b) is the case when the robot moves to its left
hand side, but does not enter the room. The determined
cleaning areas and the robot's navigations are marked on the
floor maps.


Fig. 9. Test of proposed cleaning area control system in real situation.
(a) altitude 0 , azimuth 45 (b) altitude 45 , azimuth 45 (c) altitude 0,
azimuth 90 (d) altitude 40, azimuth 135

IV. CONCLUSION
This paper provides a vision based HCI method for a
cleaning robot to navigate to a specific location in home
environment. The proposed method is based on the six
AdaBoost-based hand poses detectors. The detectors are
trained with the reduced Haar-like feature sets. The first three
stages are used as pose estimation. Experiments with test
image sets show that the use of the reduced set and the pose
estimation improve the performance of the detector in the
cluttered background environment. The cleaning area control
is simulated with real-world video sequences. The proposed
method can effectively control a cleaning robot without the
need for a user to wear or employ any input devices.
REFERENCES
[1] C. Chunlin, L. Han-Xiong, and D. Daoyi, Hybrid control for robot
navigation - a hierarchical Q-learning algorithm, IEEE Robotics and
Automation Magazine , vol. 15, no. 2, pp. 37-47, 2008.
[2] S. Saeedi, L. Paull, M. Trentini, and H. Li, Neural network-based
multiple robot simultaneous localization and mapping, IEEE Trans.
Neural Networks, vol. 22, no. 12, pp. 2376-2387, 2011.
[3] Y. Xue, and T. Xu, "An optimal and safe path planning for mobile robot
in home environment advanced research on computer science and
information engineering," Communications in Computer and
690 IEEE Transactions on Consumer Electronics, Vol. 58, No. 2, May 2012
Information Science G. Shen and X. Huang, eds., pp. 442-447: Springer
Berlin Heidelberg, 2011.
[4] Y. Chai, S. Shin, K. Chang, and T. Kim, Real-time user interface using
particle filter with integral histogram, IEEE Trans. Consumer
Electronics, vol. 56, no. 2, pp. 510-515, 2010.
[5] D. Lee, and Y. Park, Vision-based remote control system by motion
detection and open finger counting, IEEE Trans. Consumer Electronics,
vol. 55, no. 4, pp. 2308-2313, 2009.
[6] S. Y. Cheng, and M. M. Trivedi, Vision-based infotainment user
determination by hand recognition for driver assistance, IEEE Trans.
Intelligent Transportation Systems, vol. 11, no. 3, pp. 759-764, 2010.
[7] Y. V. Parkale, Gesture based operating system control, 2nd Intl. Conf.
Advanced Computing & Communication Technologies, pp. 318-323,
January 2012
[8] N. H. Dardas, and N. D. Georganas, Real-time hand gesture detection
and recognition using bag-of-features and support vector machine
techniques, IEEE trans. Instrumentation and Measurement, vol. 60, no.
11, pp. 3592-3607, 2011.
[9] C. Garcia, and M. Delakis, Convolutional face finder: a neural
architecture for fast and robust face detection, IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1408-1423, 2004.
[10] X. Shipeng, and P. Jing, "Hand detection using robust color correction
and gaussian mixture model," 6th Intl. Conf. Image and Graphics, pp.
553-557, August 2011.
[11] J. Guo, Y. Liu, C. Chang, and H. Nguyen, Improved hand tracking
system, IEEE Trans. Circuits and Systems for Video Technology, vol.
PP, no. 99, pp. 1-1, 2011.
[12] P. Viola, and M. Jones, "Rapid object detection using a boosted cascade
of simple features." Proc. The 2011 IEEE Computer Society Conf.
Computer Vision and Pattern Recognition, pp. I-511-I-518 vol.1.
[13] S. Kim, J. Y. Sim, and S. Yang, Background robust hand detection
using a reduced Haarlike feature set, submitted for publication in
Electronics Letters.
[14] W. Bo, A. Haizhou, H. Chang, and L. Shihong, "Fast rotation invariant
multi-view face detection based on real Adaboost," 6th IEEE Intl. Conf.
Automatic Face and Gesture Recognition, pp. 79-84, 2004.
[15] R. E. Schapire, and Y. Singer, Improved Boosting Algorithms Using
Confidence-rated Predictions, Machine Learning, vol. 37, no. 3, pp.
297-336, 1999.


BIOGRAPHIES
Soowoong Kim received the B.S. degree from Kumoh
National Institute of Technology, Gumi, Korea, in 2009.
He is currently a PhD candidate in the School of
Electrical and Computer Engineering at the Ulsan
National Institute of Science and Technology in Ulsan,
Korea. His research interests are in human-computer
interface and computer vision.



Jae-Young Sim (S01-M05) received the B.S. degree in
electrical engineering and the M.S. and Ph.D. degrees in
electrical engineering and computer science from Seoul
National University, Seoul, Korea, in 1999, 2001, and
2005, respectively. From 2005 to 2009, he was a
Research Staff Member, Samsung Advanced Institute of
Technology, Samsung Electronics Co., Ltd. In 2009, he
joined the School of Electrical and Computer
Engineering, Ulsan National Institute of Science and Technology (UNIST) as
an Assistant Professor. His research interests are in image and 3-D visual
signal processing, multimedia data compression, and computer vision.


Seungjoon Yang (S'09-M'00) received the B.S. degree
from Seoul National University, Seoul, Korea, in 1990,
and the M.S. and Ph.D. degrees from the University of
Wisconsin-Madison, in 1993 and 2000, respectively, all
in electrical engineering. He was with the Digital Media
R&D Center at Samsung Electronics Co., Ltd. from
September 2000 to August 2008. He is currently with the
School of Electrical and Computer Engineering at the
Ulsan National Institute of Science and Technology in
Ulsan, Korea. His research interests are in image processing, estimation
theory, and multi-rate systems.
Professor Yang received the Samsung Award for the Best Technology
Achievement of the Year in 2008 for his work in the premium digital
television platform project.

Anda mungkin juga menyukai