]
(x
) =
]
(m, n)x
(m, n)
(m,n)B
,
(1)
where B is the set of indices given by the Cartesian
product |1, E] |1, w], and
j
is the Haar-like feature
template. Examples of the feature template are shown in
Fig. 3, where the pixel values are one, minus one, and
zero for the white, black, and grey regions, respectively.
The sizes of the image and the feature template are E
w. In the detection phase, E w size image patches are
extracted from a given image, or from scaled and rotated
given images. The cascade of selected features are
calculated and compared to the thresholds to classify
whether the image patch is the object of interest or not.
For the face detection routine, the images x
s are
classified into face or no-face. While face images of the
size E w contain only the faces, the hand images of the
size E w include the backgrounds as well as the hands.
Examples of face and hand images are shown in Fig. 4. It
can be seen that a large number of pixels in a hand image
correspond to the background. The background regions
can affect the performance of detectors. If the features
affected strongly by the background are selected in the
training phase, the performances of detectors can be
severely degraded when operated against cluttered
background.
The influence of background on the detector
performance can be alleviated by excluding the features
that are easily affected by the backgrounds [13]. In the
training phases of the hand detection routines, we
eliminate such features in advance to reduce the influence
of the background. Let x be the average of the images x
's
for i = 1,2, , N. A mask is obtained by taking the region
S. Kim et al.: Vision-based Cleaning Area Control for Cleaning Robots 687
Fig. 4. Training examples of face and hand. Rectangular shaped window
can contain face region only without a background. But hand poses
cannot be contained in a rectangular window without including
background.
Fig. 6. Examples of selected features in the strong classifiers, (a) trained
with reduced feature set, and (b) trained with full feature set.
Fig. 5. Average hand images of training examples of six hand poses and
corresponding mask images.
where the average pixel values are greater than a
threshold I
A
:
p(m, n) = _
1, if x (m, n) > I
A
.
u, otheiwise.
(2)
Two thousand hand images of the same size without
background are used for each hand posture to obtain the mask
image. The average hand images and the corresponding mask
images are given in Fig. 5. The overlapping ratio between the
mask and the ]th feature template,
]
, is calculated by
y
]
=
1
wE
|
]
(m, n)|
(m,n)B
p(m, n).
(3)
A set of features inside the average hand regions is obtained by
F
M
= |
]
|
]
F
0
anu y
]
> I
M
|.
(4)
Example images (x
I
, y
I
) for i = 1, 2, , N are prepared for
the training of the face and hand-pose detectors, where y
I
is
zero and one for the negative and positive example images,
respectively. The size of the images is S2 S2. Two thousand
positive images including various background and eight
thousand negative images are prepared for each detector.
The training of the detectors is the same as the algorithm
[12]. The set F
M
instead of the original feature set F
0
is used in
the training phase. The thresholds T
A
and T
M
are 30 and 0.99,
respectively. The number of cascade stages in each detector is
set to nine.
Fig. 6 shows an example of the selected features used in
the strong classifier. It is observed that some of the selected
features trained with the original feature set
F
0
are located at
the boundaries of hand and background regions. These
selected features will provide significantly different outputs as
the background changes. In contrast, all of the selected
features trained with the reduced feature set F
M
are all inside
the hand region. The pixels in the background will not affect
the outcome of the classifiers.
A. Hand Detectors with Pose Estimation
In the proposed method, we employ six hand-pose
detectors. In the detection phase, patches of images are
exhaustively classified as hand or no-hand. The computational
complexity of the six independently run hand-pose detectors is
six times that of a single detector. A method to reduce the
computational complexity is to utilize the confidence level
[14]. Let k = 1, 2, , 6 be the index of the six hand-pose
detectors, and l = 1, 2, ,9 be the index of the nine cascade
stages. The confidence level of the lth cascade stage of the kth
detector is defined by
C
I
k
(x
) = |
]
(x
)
]
|,
where the sum is over all the selected features in the lth stage
of the kth detector. The confidence level of the kth detector
after the first I cascade stage is obtained by
C
1:L
k
(x
) = _C
I
k
(x
)
L
I=1
The hand-pose is estimated by finding the pose with
maximum confidence level by
k
`
= aig max
k
C
1:L
k
(x
).
Once the hand-pose is obtained, only the k
`
th detector
finishes the stages from (I + 1)th to 9th to classify the input
image x
,
are also shown. It can be seen that all the proposed hand
detectors outperform the original hand detectors. For given
detection rates greater than 0.9, the false alarm ratios of the
proposed detectors are smaller than those of the original
detectors. The miss, false positive, and hit rates of the
detectors tested on the test set are summarized in TABLE II.
The proposed detectors trained with the reduced feature set
provide higher hit rates than the original detectors for all the
hand postures.
S. Kim et al.: Vision-based Cleaning Area Control for Cleaning Robots 689
B. Performance of Hand Detector with Pose Estimation
Two hundreds out of a thousand test images for each of the
six hand pose detector tests are selected and combined into a
single test image set. The performance of the hand detector
with pose estimation is evaluated using this test set. The miss,
false positive, and hit rates of the detectors tested on the test
set are summarized in TABLE III. For comparison, the six
hand-pose detectors are executed separately in parallel, and
the resulting miss, false positive, and hit rates are also
provided. The cases where the detectors are trained with the
original feature sets are also given for comparison. It can be
seen that the hand detector with pose estimation provides a
higher hit rate than the individually run detectors. For the
detectors trained with the reduced feature set, the hit rate is
improved from 96.67% to 96.75%. For the detectors trained
with the original feature set the hit rate is improved from
93.67% to 94.68%. The performances of the detectors trained
with reduced feature sets are better than those trained with the
original detectors. The hit rates of the detector with pose
estimation are 96.75% when trained with the reduced set, and
94.58% when trained with the original feature set. The hit rate
of the detectors without pose estimation is 96.67% when
trained with the reduced feature set, and 94.67% when trained
with the original feature set.
We can conclude that i) the use of the reduced feature set in
the training phase makes the detectors robust to the influence
of the cluttered background, and ii) the use of the first three
stages of the cascade as pose estimation improves the
performance of the detector. Note that the computational
complexity of the detector with pose estimation is
considerably smaller because six stages of the cascade are
performed only for the one pose.
TABLE III
MISS, FALSE AND HIT RATE OF DETECTORS WITH AND WITHOUT POSE
ESTIMATION FOR A CLUTTERED BACKGROUND TEST SET
feature set type
hand detectors with pose
estimation
hand detectors without
pose estimation
miss false hit rate miss false hit rate
reduced
feature set
39 1821 96.75% 40 2528 96.67%
original
feature set
65 3245 94.58% 76 5264 93.67%
C. Cleaning Area Control
Fig. 9 shows the examples of real-world video sequences
and the determined cleaning areas. One frame of each video
sequence is shown with the detected face and hand indicated
with color circles. The location of a user is shown on a floor
map with a red rectangle. From the detected hand pose the
cleaning area the user points to is determined, and the robot
moves to the cleaning area. For example, Fig. 9 (a) is the
case when the robot moves to a room located on its left hand
side. Fig. 9 (b) is the case when the robot moves to its left
hand side, but does not enter the room. The determined
cleaning areas and the robot's navigations are marked on the
floor maps.
Fig. 9. Test of proposed cleaning area control system in real situation.
(a) altitude 0 , azimuth 45 (b) altitude 45 , azimuth 45 (c) altitude 0,
azimuth 90 (d) altitude 40, azimuth 135
IV. CONCLUSION
This paper provides a vision based HCI method for a
cleaning robot to navigate to a specific location in home
environment. The proposed method is based on the six
AdaBoost-based hand poses detectors. The detectors are
trained with the reduced Haar-like feature sets. The first three
stages are used as pose estimation. Experiments with test
image sets show that the use of the reduced set and the pose
estimation improve the performance of the detector in the
cluttered background environment. The cleaning area control
is simulated with real-world video sequences. The proposed
method can effectively control a cleaning robot without the
need for a user to wear or employ any input devices.
REFERENCES
[1] C. Chunlin, L. Han-Xiong, and D. Daoyi, Hybrid control for robot
navigation - a hierarchical Q-learning algorithm, IEEE Robotics and
Automation Magazine , vol. 15, no. 2, pp. 37-47, 2008.
[2] S. Saeedi, L. Paull, M. Trentini, and H. Li, Neural network-based
multiple robot simultaneous localization and mapping, IEEE Trans.
Neural Networks, vol. 22, no. 12, pp. 2376-2387, 2011.
[3] Y. Xue, and T. Xu, "An optimal and safe path planning for mobile robot
in home environment advanced research on computer science and
information engineering," Communications in Computer and
690 IEEE Transactions on Consumer Electronics, Vol. 58, No. 2, May 2012
Information Science G. Shen and X. Huang, eds., pp. 442-447: Springer
Berlin Heidelberg, 2011.
[4] Y. Chai, S. Shin, K. Chang, and T. Kim, Real-time user interface using
particle filter with integral histogram, IEEE Trans. Consumer
Electronics, vol. 56, no. 2, pp. 510-515, 2010.
[5] D. Lee, and Y. Park, Vision-based remote control system by motion
detection and open finger counting, IEEE Trans. Consumer Electronics,
vol. 55, no. 4, pp. 2308-2313, 2009.
[6] S. Y. Cheng, and M. M. Trivedi, Vision-based infotainment user
determination by hand recognition for driver assistance, IEEE Trans.
Intelligent Transportation Systems, vol. 11, no. 3, pp. 759-764, 2010.
[7] Y. V. Parkale, Gesture based operating system control, 2nd Intl. Conf.
Advanced Computing & Communication Technologies, pp. 318-323,
January 2012
[8] N. H. Dardas, and N. D. Georganas, Real-time hand gesture detection
and recognition using bag-of-features and support vector machine
techniques, IEEE trans. Instrumentation and Measurement, vol. 60, no.
11, pp. 3592-3607, 2011.
[9] C. Garcia, and M. Delakis, Convolutional face finder: a neural
architecture for fast and robust face detection, IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1408-1423, 2004.
[10] X. Shipeng, and P. Jing, "Hand detection using robust color correction
and gaussian mixture model," 6th Intl. Conf. Image and Graphics, pp.
553-557, August 2011.
[11] J. Guo, Y. Liu, C. Chang, and H. Nguyen, Improved hand tracking
system, IEEE Trans. Circuits and Systems for Video Technology, vol.
PP, no. 99, pp. 1-1, 2011.
[12] P. Viola, and M. Jones, "Rapid object detection using a boosted cascade
of simple features." Proc. The 2011 IEEE Computer Society Conf.
Computer Vision and Pattern Recognition, pp. I-511-I-518 vol.1.
[13] S. Kim, J. Y. Sim, and S. Yang, Background robust hand detection
using a reduced Haarlike feature set, submitted for publication in
Electronics Letters.
[14] W. Bo, A. Haizhou, H. Chang, and L. Shihong, "Fast rotation invariant
multi-view face detection based on real Adaboost," 6th IEEE Intl. Conf.
Automatic Face and Gesture Recognition, pp. 79-84, 2004.
[15] R. E. Schapire, and Y. Singer, Improved Boosting Algorithms Using
Confidence-rated Predictions, Machine Learning, vol. 37, no. 3, pp.
297-336, 1999.
BIOGRAPHIES
Soowoong Kim received the B.S. degree from Kumoh
National Institute of Technology, Gumi, Korea, in 2009.
He is currently a PhD candidate in the School of
Electrical and Computer Engineering at the Ulsan
National Institute of Science and Technology in Ulsan,
Korea. His research interests are in human-computer
interface and computer vision.
Jae-Young Sim (S01-M05) received the B.S. degree in
electrical engineering and the M.S. and Ph.D. degrees in
electrical engineering and computer science from Seoul
National University, Seoul, Korea, in 1999, 2001, and
2005, respectively. From 2005 to 2009, he was a
Research Staff Member, Samsung Advanced Institute of
Technology, Samsung Electronics Co., Ltd. In 2009, he
joined the School of Electrical and Computer
Engineering, Ulsan National Institute of Science and Technology (UNIST) as
an Assistant Professor. His research interests are in image and 3-D visual
signal processing, multimedia data compression, and computer vision.
Seungjoon Yang (S'09-M'00) received the B.S. degree
from Seoul National University, Seoul, Korea, in 1990,
and the M.S. and Ph.D. degrees from the University of
Wisconsin-Madison, in 1993 and 2000, respectively, all
in electrical engineering. He was with the Digital Media
R&D Center at Samsung Electronics Co., Ltd. from
September 2000 to August 2008. He is currently with the
School of Electrical and Computer Engineering at the
Ulsan National Institute of Science and Technology in
Ulsan, Korea. His research interests are in image processing, estimation
theory, and multi-rate systems.
Professor Yang received the Samsung Award for the Best Technology
Achievement of the Year in 2008 for his work in the premium digital
television platform project.