Anda di halaman 1dari 5

ICOT 2015

Human Action Recognition System for Elderly and


Children Care Using Three Stream ConvNet

Chang-Di Huang, Chien-Yao Wang, and Jia-Ching Wang


Department of Computer Science and Information Engineering
National Central University
Taoyuan, Taiwan, R.O.C.
fighting8139@gmail.com, x102432003@yahoo.com.tw, and jcw@csie.ncu.edu.tw

Abstract—Because of the change of family structure and


population ageing, elderly and children care is become a very
important issue in modern society. When adults are busy
working, they have no time to care elderly and children who
standalone in the home. This paper proposes an elderly and
children care system to solve this important problem. The
proposed intelligent surveillance system is based on action
recognition technique of image processing. In this paper, a three
stream convolution neural network is proposed for recognize
human actions such as fall floor and baby craw. If the system
detect abnormal activities are occurred, it will raise alarm and
notice family members. In the experiment, there are totally 21
categories activities are collected from HMDB-51 dataset, UCF-
101 dataset and Internet. The proposed system achieves 93.42%
recognition rate of selected actions.

Keywords—action recognition; three stream ConvNet;


convolutional neural network; deep learning; spatial; temporal; Figure 1. Example of Actions.
moving
Convolutional networks (ConvNets) [8] are available for
1. Introduction extracting image features in convolutional layers. These
In modern society, most of there is double-income families. features then transfer learning tasks in the last fully-connected
When they are busy working, elderly and children care layers. [9, 10] Learning features in ConvNet are better image
becomes very important. If hiring a caregiver, it is not retrieval representation than the technology used before, and
anywhere and anytime to protect the elderly and children, in improve the performance on image retrieving.
addition to the cost of a caregiver is quite expensive. If there is There are four types of ConvNet architectures are used for
an IP camera to monitor home security and detect motion action recognition task:
information, which not limited by temporal and spatial (1) 2-dimention convolutional network as feature extractor,
components. Using captured video from IP camera, action Support Vector Machine (SVM) and Neural Network
recognition technique can be applied to elderly and children (NN) as classifier [1]. The proposed system is mainly
care by recognizing whether the elderly or children take based on this approach.
dangerous actions. Fig. 1 shows some examples of safe actions (2) 3-dimention convolutional network as feature extractor,
which is acceptable and dangerous actions which need to raise SVM and NN as classifier [6, 11]. This framework uses
alarm. 3D convolutional kernel (time by height by width) to
Image Retrieval has been applied to video process for years. solve spatial-temporal feature.
There is a large amount of research gone into Image Retrieval. (3) 2-dimention convolutional network as feature extractor,
One of which is text-based image retrieval (TBIR), and the Recurrent Neural Network (RNN) and Long Short-Term
other is content-based image retrieval (CBIR). But its Memory (LSTM) as classifier [12]. The architecture
performance is still much more to improve. It faced different combined with spatial and temporal features could solve
kinds of problems such as action recognition [3, 4], and more complex video data.
activity understanding [5]. However, there is still lots of need (4) 3-dimention convolutional network as feature extractor
for action recognition that helps in classifying large-scale and RNN LSTM as classifier [13].
videos [6]. Using camera or video to recognize human action is
High-level image feature learning for action recognition has computationally demanding. In [1], it developed two-stream
been making great progress in the past few years. ConvNets, which combined with two different ConvNets. Still

978-1-4673-8237-3/15/$31.00 ©2015 IEEE 5


Figure 2. Three-stream convolutional network architecture for action recognition.
one ConvNet was trained for image classification, while the images, the input of the temporal-stream convolutional
other ConvNet was trained on the dense video optical flow. network is 94x48@6 optical flow images, and the input of the
Two separated recognition stream, that is spatial stream and moving-stream network is motion vectors of detected human
temporal stream, will be combined by late fusion. Spatial and region centroids. Proposed convolutional network has 2
temporal streams successfully decrease numbers of learning convolutional layers, 1 pooling layer, 2 fully connected layers
parameters. Owing to this reason, two-stream ConvNets and final followed by a softmax classifier output object layer.
conquer the computational demand. All convolutional kernels are 3x3@64. The pooling layer is
In this work, we proposed a three-stream ConvNets to 2x2. All full-connected layers have 256 output units. And the
video data. Beside spatial stream and temporal stream, the Softmax full connected layer is used for fusion results of
centroid of human is used to make moving stream network. three-stream convolutional network.
Recognize what action does the character in the video do, for 3. Proposed Method
example: walking, running, opening the window and hanging This section contains two parts. The first part is
clothes, etc. Once the result of the action recognition introduction of convolutional network and the second part is
determined that behavior is abnormal [2, 7], then generates a details of proposed three-stream convolutional network.
corresponding solution such as sending warning SMS to the 3.1 Convolutional Network
householder. Eventually build a smart living for elderly and ConvNet [19] is essentially a kind of input-to-output
children care with surveillance systems. mapping. The purpose of the training is to learn the mapping
The rest of the paper is constructed as follows. In section 2 ability between input and output. The model of the ConvNet is
we give our three-stream ConvNets system an overview, and mainly composed of several numbers of convolutional layers
clarify each size of the layer. In section 3, firstly we present and pooling layers. Finally use high-level features, which
the function each layer work in ConvNets. Secondly we generated by fully connected layers, as the basis of identifying
introduce the three-stream architecture. Thirdly we specify the the input image.
Spatial and Temporal ConvNet. In section 4 we introduce the The convolutional layer make use of convolutional kernels
two action recognition datasets, and show which class we to process convolutional evaluation with the input image.
select to apply on home care and smart living. Experimental These image features then convey to the next layer as mapping
results are given in section 5 which include the recognition feature. Thus the convolutional layer is just an image
rate evaluated. And conclusion is in section 6. convolution of the previous layer.
2. System Overview After each convolutional layer, there may be a pooling
The system overview of the proposed system framework is layer. The pooling layer uses local correlation principle to
shown in Fig. 2. First, the deformable part model (DPM) [18] subsample images; meanwhile it decreases the dimension of
is used for extract human regions from video. Extracted the pixels, retaining useful characteristic information. In Fig. 3,
human regions are resize to 96x48 RGB images. The input of there are two ways to do subsampling. One is average pooling
the spatial-stream convolutional network is 96x48@6 RGB layer, another is maximum. In this work, we always use the

6
parameters’ models and approximate image neighborhoods
with polynomials [1, 15]. Another method sets the energy
based on constancy assumptions for intensity [16]. Unlike
spatial ConvNet, when processing through the temporal
stream, input video with consecutive frames is composed of
optical flow which contains hundreds of millions displacement
fields. Since we don’t need to estimate the motion directly,
this kind of the input depicts the movement among video
frames, making the action recognition more accurate.
4. Experimental Setup
In our experiment, video training is carried out on UCF-
Figure 3. Pooling operation.
101 and HMDB-51 datasets. We introduce these two datasets
max-pooling layer which take the maximum of the block that and list the video categories respectively.
they are pooling. UCF-101. UCF-101 is an action videos data sets, collected
After several convolutional layers and max-pooling layers, mainly from YouTube. Before UCF-101, UCF-50 is the
the high-level deep learning result is done by the fully- previous version of action recognition data set which contains
connected layer. Fully-connected layer can be visualized as 50 action categories. But now UCF-101 has 101 action
one-dimension since they are not located anymore. Apparently, categories, extending training task for action recognition.
there is no convolutional layers after fully-connected layers. Videos in UCF-101 data set are gathered into 25 groups. Same
3.2 Three-Stream Convolutional Network video group may share some similar feature, such like
In three-stream ConvNets, which is inspired from two- analogous background or analogous visual angle. UCF-101
stream ConvNets [1], video can be separated into temporal video categories are grouped into five types [17]:
stream and spatial stream naturally. The spatial stream (1) Body-Motion Only
includes the information about frames and object scenes in the (2) Human-Human Interaction
video. On the other hand, the temporal stream, formed by (3) Human-Object Interaction
motion, carries the movement of camera or other observers. (4) Playing Musical Instruments
Frames of this two streams are cropped and realignment by (5) Sports
extracted human regions and fed into deep convolutional We select several kinds of videos about elderly and
neural network, so they can learnt and tested separately. children care for smart living from five types of categories in
During the human region extraction, the moving stream can be the UCF-101 dataset, including BabyCrawling,
extracted from movement of interested human and be the input CuttingInKitchen, MoppingFloor, Swing, TrampolineJumping
of multi-layer perceptron. After deep learning, these three and WalkingWithDog.
streams are finally combined with the hinge loss classifier. HMDB-51. HMDB mostly collected from movies, and a small
Movement Neural Network. There are some actions have part coming from Internet databases, including YouTube, the
similar pose change but with different speed or direction, for Prelinger archive and Google. HMDB, which holds six-
example walking versus running and push versus pull. The thousands of clips, divide into 51 action categories. Categories
moving-stream can be used in these kind of actions. In the can be grouped into five types [18]:
other hand, some actions are safe when the movement is slow (1) Body movements with human interaction
but dangerous when the movement is quick. The moving- (2) Body movements with object interaction
stream network can also detect these warning. The input of (3) Facial actions with object manipulation
moving-stream is the centroid of detected human regions. (4) General body movements
Spatial Convolutional Network. The input of the (5) General facial action
proposed spatial stream is a sequence of serial frames while it We select nine categories of videos matched the research
is an individual frame in [1]. The selection of frames in the topic of elderly and children care. The categories we select
spatial stream is a very important element, because most of the include ClimbStairs, FallFloor, Hit, Pick, Push, Run, Smoke,
videos are associated with particular motions. [14] using the Walk and Wave.
three dimensional convolutional neural network for human We also collect some video of activities such as Sit, Stand,
action recognition. It shows that six frames is sufficient for Jogging, etc. from the Internet. There are totally 21 categories
action recognize task. In our framework, the number of input activities are selected for experiments.
frames also set as six. 5. Experimental Results
Temporal Convolutional Network. In this work, the In this work, our topic is to build a smart living for elderly
temporal ConvNet evaluates on optical flow extracted by and children care. The action category we selected are
Lucas-Kanade method. Optical flow estimation quantifies and generalized into two types including indoor activities and
evaluates the movement of objects in the field of computer outdoor activities. Here the activity is set as outdoor activity if
vision with multiple frames. It often work at multiple image an activity may occur in both indoor and outdoor. In
scales, evaluating in an energy framework. A common optical experiment results, two types of single stream ConvNets and
flow method for solving a displacement field is to use eight two stream ConvNet are selected as baseline systems.

7
Indoor Activities. For elderly and children, some behavior is Fig. 4 shows the convergence of the baseline systems and
dangerous when they are at home. Crawling babies may be hit proposed system. The green line is objective value of spatial
by falling objects when they knock down the table; other stream ConvNet. The cyan line is for temporal stream
behavior like cutting with knife and mopping floor are still at a ConvNet. The blue line is for two stream ConvNet. And the
high risk for the elderly. To sum up, BabyCrawling, red line is for proposed three stream ConvNet. The results
CuttingInKitchen, MoppingFloor are summarized as indoor shows that proposed system can converge to the best value.
activities. 6. Conclusion
Outdoor Activities. Amusement facilities in the yard In this work, an elderly and children system which is based
sometimes cause the children to fall and hurt. So if we want to on action recognition via deep convolutional neural network is
build a smart living, we need to monitor the amusement proposed. To recognize human action, a novel ConvNet
facilities. Therefore, two classes of action recognition videos, architecture called three stream ConvNet is proposed. Three
Swing and TrampolineJumping, are put in the type of outdoor stream ConvNet considerate spatial information, temporal
activities. Some outdoor activities are static, such as Pick, information and movement of human for classify the human
Smoke, WalkingWithDog, Push, Walk and Wave. The others activities. The proposed system can recognize more than 20
are relatively dynamic, including the classes of ClimbStair, categories of human daily activities and detect the activities is
FallFloor, Hit and Run. normal or unnormal. The recognition rate can achieve 93.42%.
TABLE I REFERENCES
RESULTS COMPARISON
[1] K. Simonyan, A. Zisserman, “Two-Stream Convolutional Networks for
ConvNets Inputs Results Action Recognition in Videos”, arXiv:1406.2199v2 [cs.CV], 2014, In
Singe Steam ConvNet Spatial 93.19% NIPS, 2014.
Single Steam ConvNet Temporal 92.95% [2] O. Boiman and M. Irani. Detecting irregularities in images and in video.
Two Steam ConvNet Spatial, Temporal 93.27% IJCV, 2007.
Three Steam ConvNet Spatial, Temporal, Movement 93.42% [3] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions [1]
as space-time shapes. In ICCV, pages 1395–1402, 2005.
Table I presents the recognition rates of the baseline systems [4] I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, 2003.
and proposed system. The recognition rate of single stream [5] D. B. Kris M. Kitani, Brian D. Ziebart and M. Hebert. Activity
ConvNets are 93.19% and 92.95% when using spatial stream forecasting. In ECCV, 2012.
input and temporal stream input. After combine this two kind [6] Andrej Karpathy, George Toderici, SankethShetty, Thomas Leung,
Rahul Sukthankar, Li Fei-Fei, “Large-scale Video Classification with
of inputs, the two stream ConvNet perform better results than Convolutional Neural Networks”, In CVPR, 2014.
uses only one kind of input. It gets 93.27% recognition rate of [7] P. Doll´ar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior
selected 21 activities. The proposed system get better recognition via sparse spatio-temporal features. In International
recognition rates than all baseline systems. The recognition Workshop on Visual Surveillance and Performance Evaluation of
Tracking and Surveillance, 2005.
rate of proposed system is 93.42%, It improves 0.23%, 0.47%
[8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S.
and 0.15% than spatial stream ConvNet, temporal stream Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast
ConvNet and two stream ConvNet. feature embedding. arXiv preprint arXiv:1408.5093, 2014.
spatial
[9] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda:
temporal Pose aligned networks for deep attribute modeling. In CVPR, 2014.
2 stream [10] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning
3 stream
deep features for scene recognition using places database. In NIPS, 2014.
[11] Du Tran, LubomirBourdev, Rob Fergus, Lorenzo Torresani, Manohar
Paluri, “Learning Spatiotemporal Features with 3D Convolutional
Networks”, arXiv:1412.0767v3 [cs.CV], 2015.
[12] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus
0.4 Rohrbach, SubhashiniVenugopalan, Kate Saenko, Trevor Darrell,
10
“Long-term Recurrent Convolutional Networks for Visual Recognition
and Description”, arXiv:1411.4389 [cs.CV], 2015.
energy

[13] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, And A. Baskurt.


Sequential Deep Learning For Human Action Recognition. In Human
Behavior Understanding. 2011.
[14] Shuiwang Ji; Wei Xu; Ming Yang; Kai Yu, "3D Convolutional Neural
Networks for Human Action Recognition," in Pattern Analysis and
Machine Intelligence, IEEE Transactions on , vol.35, no.1, pp.221-231,
Jan. 2013.
[15] G. Farneb¨ack. Two-frame motion estimation based on polynomial
expansion. In SCIA, pages 363–370, 2003.
[16] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy
0.3
10 optical flow estimation based on a theory for warping. In Proc. ECCV,
pages 25–36, 2004.
0 10 20 30 40 50 60 70 80
training epoch
[17] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101
human actions classes from videos in the wild. CoRR, abs/1212.0402,
Figure 4. Objective. 2012.

8
[18] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan. Object
Detection with Discriminatively Trained Part Based Models. IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No.
9, Sep. 2010
[19] A. Krizhevsky, I. Sutskever, and G. E Hinton,. ImageNet Classification
with Deep Convolutional Neural Networks NIPS 2012: Neural
Information Processing Systems, Lake Tahoe, Nevada

Anda mungkin juga menyukai