6
parameters’ models and approximate image neighborhoods
with polynomials [1, 15]. Another method sets the energy
based on constancy assumptions for intensity [16]. Unlike
spatial ConvNet, when processing through the temporal
stream, input video with consecutive frames is composed of
optical flow which contains hundreds of millions displacement
fields. Since we don’t need to estimate the motion directly,
this kind of the input depicts the movement among video
frames, making the action recognition more accurate.
4. Experimental Setup
In our experiment, video training is carried out on UCF-
Figure 3. Pooling operation.
101 and HMDB-51 datasets. We introduce these two datasets
max-pooling layer which take the maximum of the block that and list the video categories respectively.
they are pooling. UCF-101. UCF-101 is an action videos data sets, collected
After several convolutional layers and max-pooling layers, mainly from YouTube. Before UCF-101, UCF-50 is the
the high-level deep learning result is done by the fully- previous version of action recognition data set which contains
connected layer. Fully-connected layer can be visualized as 50 action categories. But now UCF-101 has 101 action
one-dimension since they are not located anymore. Apparently, categories, extending training task for action recognition.
there is no convolutional layers after fully-connected layers. Videos in UCF-101 data set are gathered into 25 groups. Same
3.2 Three-Stream Convolutional Network video group may share some similar feature, such like
In three-stream ConvNets, which is inspired from two- analogous background or analogous visual angle. UCF-101
stream ConvNets [1], video can be separated into temporal video categories are grouped into five types [17]:
stream and spatial stream naturally. The spatial stream (1) Body-Motion Only
includes the information about frames and object scenes in the (2) Human-Human Interaction
video. On the other hand, the temporal stream, formed by (3) Human-Object Interaction
motion, carries the movement of camera or other observers. (4) Playing Musical Instruments
Frames of this two streams are cropped and realignment by (5) Sports
extracted human regions and fed into deep convolutional We select several kinds of videos about elderly and
neural network, so they can learnt and tested separately. children care for smart living from five types of categories in
During the human region extraction, the moving stream can be the UCF-101 dataset, including BabyCrawling,
extracted from movement of interested human and be the input CuttingInKitchen, MoppingFloor, Swing, TrampolineJumping
of multi-layer perceptron. After deep learning, these three and WalkingWithDog.
streams are finally combined with the hinge loss classifier. HMDB-51. HMDB mostly collected from movies, and a small
Movement Neural Network. There are some actions have part coming from Internet databases, including YouTube, the
similar pose change but with different speed or direction, for Prelinger archive and Google. HMDB, which holds six-
example walking versus running and push versus pull. The thousands of clips, divide into 51 action categories. Categories
moving-stream can be used in these kind of actions. In the can be grouped into five types [18]:
other hand, some actions are safe when the movement is slow (1) Body movements with human interaction
but dangerous when the movement is quick. The moving- (2) Body movements with object interaction
stream network can also detect these warning. The input of (3) Facial actions with object manipulation
moving-stream is the centroid of detected human regions. (4) General body movements
Spatial Convolutional Network. The input of the (5) General facial action
proposed spatial stream is a sequence of serial frames while it We select nine categories of videos matched the research
is an individual frame in [1]. The selection of frames in the topic of elderly and children care. The categories we select
spatial stream is a very important element, because most of the include ClimbStairs, FallFloor, Hit, Pick, Push, Run, Smoke,
videos are associated with particular motions. [14] using the Walk and Wave.
three dimensional convolutional neural network for human We also collect some video of activities such as Sit, Stand,
action recognition. It shows that six frames is sufficient for Jogging, etc. from the Internet. There are totally 21 categories
action recognize task. In our framework, the number of input activities are selected for experiments.
frames also set as six. 5. Experimental Results
Temporal Convolutional Network. In this work, the In this work, our topic is to build a smart living for elderly
temporal ConvNet evaluates on optical flow extracted by and children care. The action category we selected are
Lucas-Kanade method. Optical flow estimation quantifies and generalized into two types including indoor activities and
evaluates the movement of objects in the field of computer outdoor activities. Here the activity is set as outdoor activity if
vision with multiple frames. It often work at multiple image an activity may occur in both indoor and outdoor. In
scales, evaluating in an energy framework. A common optical experiment results, two types of single stream ConvNets and
flow method for solving a displacement field is to use eight two stream ConvNet are selected as baseline systems.
7
Indoor Activities. For elderly and children, some behavior is Fig. 4 shows the convergence of the baseline systems and
dangerous when they are at home. Crawling babies may be hit proposed system. The green line is objective value of spatial
by falling objects when they knock down the table; other stream ConvNet. The cyan line is for temporal stream
behavior like cutting with knife and mopping floor are still at a ConvNet. The blue line is for two stream ConvNet. And the
high risk for the elderly. To sum up, BabyCrawling, red line is for proposed three stream ConvNet. The results
CuttingInKitchen, MoppingFloor are summarized as indoor shows that proposed system can converge to the best value.
activities. 6. Conclusion
Outdoor Activities. Amusement facilities in the yard In this work, an elderly and children system which is based
sometimes cause the children to fall and hurt. So if we want to on action recognition via deep convolutional neural network is
build a smart living, we need to monitor the amusement proposed. To recognize human action, a novel ConvNet
facilities. Therefore, two classes of action recognition videos, architecture called three stream ConvNet is proposed. Three
Swing and TrampolineJumping, are put in the type of outdoor stream ConvNet considerate spatial information, temporal
activities. Some outdoor activities are static, such as Pick, information and movement of human for classify the human
Smoke, WalkingWithDog, Push, Walk and Wave. The others activities. The proposed system can recognize more than 20
are relatively dynamic, including the classes of ClimbStair, categories of human daily activities and detect the activities is
FallFloor, Hit and Run. normal or unnormal. The recognition rate can achieve 93.42%.
TABLE I REFERENCES
RESULTS COMPARISON
[1] K. Simonyan, A. Zisserman, “Two-Stream Convolutional Networks for
ConvNets Inputs Results Action Recognition in Videos”, arXiv:1406.2199v2 [cs.CV], 2014, In
Singe Steam ConvNet Spatial 93.19% NIPS, 2014.
Single Steam ConvNet Temporal 92.95% [2] O. Boiman and M. Irani. Detecting irregularities in images and in video.
Two Steam ConvNet Spatial, Temporal 93.27% IJCV, 2007.
Three Steam ConvNet Spatial, Temporal, Movement 93.42% [3] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions [1]
as space-time shapes. In ICCV, pages 1395–1402, 2005.
Table I presents the recognition rates of the baseline systems [4] I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, 2003.
and proposed system. The recognition rate of single stream [5] D. B. Kris M. Kitani, Brian D. Ziebart and M. Hebert. Activity
ConvNets are 93.19% and 92.95% when using spatial stream forecasting. In ECCV, 2012.
input and temporal stream input. After combine this two kind [6] Andrej Karpathy, George Toderici, SankethShetty, Thomas Leung,
Rahul Sukthankar, Li Fei-Fei, “Large-scale Video Classification with
of inputs, the two stream ConvNet perform better results than Convolutional Neural Networks”, In CVPR, 2014.
uses only one kind of input. It gets 93.27% recognition rate of [7] P. Doll´ar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior
selected 21 activities. The proposed system get better recognition via sparse spatio-temporal features. In International
recognition rates than all baseline systems. The recognition Workshop on Visual Surveillance and Performance Evaluation of
Tracking and Surveillance, 2005.
rate of proposed system is 93.42%, It improves 0.23%, 0.47%
[8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S.
and 0.15% than spatial stream ConvNet, temporal stream Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast
ConvNet and two stream ConvNet. feature embedding. arXiv preprint arXiv:1408.5093, 2014.
spatial
[9] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda:
temporal Pose aligned networks for deep attribute modeling. In CVPR, 2014.
2 stream [10] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning
3 stream
deep features for scene recognition using places database. In NIPS, 2014.
[11] Du Tran, LubomirBourdev, Rob Fergus, Lorenzo Torresani, Manohar
Paluri, “Learning Spatiotemporal Features with 3D Convolutional
Networks”, arXiv:1412.0767v3 [cs.CV], 2015.
[12] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus
0.4 Rohrbach, SubhashiniVenugopalan, Kate Saenko, Trevor Darrell,
10
“Long-term Recurrent Convolutional Networks for Visual Recognition
and Description”, arXiv:1411.4389 [cs.CV], 2015.
energy
8
[18] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan. Object
Detection with Discriminatively Trained Part Based Models. IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. 32, No.
9, Sep. 2010
[19] A. Krizhevsky, I. Sutskever, and G. E Hinton,. ImageNet Classification
with Deep Convolutional Neural Networks NIPS 2012: Neural
Information Processing Systems, Lake Tahoe, Nevada