Past lectures
Recognition in static images
Object recognition
Image categorisation
Todays lectures
Recognition in image sequences
Action recognition
(body gestures / facial expressions)
To come
Recognition in image sequences
Action recognition
(body gestures / facial expressions)
Tracking
Structure from motion
Surveillance
Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects (in brief)
Part-based action localisation
Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects
Part-based action localisation
What is an action?
Poses
Nouns and Predicates
<man, swings, hammer>
<man, hits, nail, w/ hammer>
Applications
Human-Computer
interfaces
Augmented Reality
Sports Analysis
[ C. Sminchisescu, 2007 ]
Surveillance
http://users.isr.ist.utl.pt/~etienne/mypubs/Auvinetal06PETS.pdf
Interfaces
2011
Interfaces
2011
1995
Held
Objects
Pose
Nearby
Objects
Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects
Part-based action localisation
Representing Motion
Optical Flow with Motion History
Representing Motion
Optical Flow with Split Channels
Representing Motion
Tracked Points
Representing
Motion
Space-Time Interest Points
Corner detectors in
space-time
Laptev 2005
Representing
Motion
Space-Time Interest Points
Laptev 2005
Representing Motion
Space-Time Volumes
Todays lecture
Introduction application
Features
Feature classification methods
Recognition using pose estimation and
objects
Part-based action localisation
Training
Labels
Classifier
Training
Trained
Classifier
Training
Labels
Classifier
Training
Trained
Classifier
Testing
Image
Features
Test Image
Trained
Classifier
Prediction
Outdoor
Spatial pyramids.
Corner detectors in
space-time
Classification
Boosted stubs for pyramids of optical flow,
gradient
Nearest neighbor for STIP
Without keyframe
detection
Talk on phone
Approach
Space-time interest point detectors
Descriptors
HOG, HOF
Spatio-Temporal Binning
Interest Points
Results
Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects
Part-based action localisation
Human-Object Interaction
Holistic image based classification
Integrated reasoning
Human pose estimation
Head
Torso
Human-Object Interaction
Holistic image based classification
Integrated reasoning
Human pose estimation
Object detection
Tennis
racket
Human-Object Interaction
Holistic image based classification
Integrated reasoning
Human pose estimation
Object detection
Action categorization
Head
Tennis
racket
Torso
Human pose
estimation is
challenging.
Difficult part
appearance
Self-occlusion
Human pose
estimation is
challenging.
Small, low-resolution,
partially occluded
Human pose
Object
O:
O
Tennis Croquet Volleyball
racket mallet
Body parts
P1
P2
PN
f1
f2
fN
H:
fO
Intra-class variations
More than one H for each A;
Unobserved during training.
P:
f:
Image evidence
0.9
Classification accuracy
83.3%
78.9%
0.8
0.7
0.6
0.5
52.5%
Our
Our
model
model
Gupta et
Tennis
forehand
Bag-of-
Gupta
al, 2009et Bag-of-words
Words
al, 2009 SIFT+SVM
Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects
Part-based action localisation
Part-based Recognition&Localisation
Implicit shape model
Goal:
Recognize categories of
actions
Localize them in terms of their
bounding box (space +
time)
Challenges:
Occlusions, clutter, variations,
Hypothesis:
Analysis can be restricted on a set of
HD=7.201
53
0.8
0.6
0.4
0.2
0
-0.2
0
29
20
59
40
60
80
54
Spatiotemporal Saliency:
Driven by signal unpredictability in a spatiotemporal
volume (cylinder / sphere)
Y vk
Examine
entropy:
w vk H vk
Entropys
peakness
w s, d , u
s
q
p D s , d , u dq
Entropys height
d
q
p D s , d , u dq
55
Input sequence
Optical Flow
Spatiotemporal
Optical Flow
after median subtractionSalient Point Detection
c1
c2
cN
Codebook
(class-specific)
Feature selection
Ensemble codewords
56
Class-dependent
Spatio-temporal probabilistic
voting
Parameters stored for each ensemble e d in the
training set
average spatial position of ensemble with
X
respect to subject center and lower bound.
distance in frames of the activated ensemble
T
from the start/end of the action
average spatiotemporal scale of ensemble.
S
| ci
wi
p ed | ci p
| ed
-t
ed
T-t
t
Current frame
ci
57
| ci
ed
Discriminative learning
Higher weights for pdfs with higher
localisation accuracy
wi
exp(
d p
| c i log p
| ci
| ci
Spatio-temporal probabilistic
voting
59
K ( F , F ')
cl ( F ; w )
w0
w jKl F , Fj
i
1 e
cl F ; w
60
61
Action recognition
Summary
Advantages
Fast recognition
Robustness to occlusions
Disadvantages
Take-home messages
Action recognition is an open problem.
Take-home messages
Some work done, but it is just the beginning of
exploring the problem. So far
Actions are mainly categorical
Most approaches are classification using simple features
(spatial-temporal histograms of gradients or flow, s-t interest
points, SIFT in images)
Just a couple works on how to incorporate pose and objects
Not much idea of how to reason about long-term activities or
to describe video sequences
To come
Recognition in image sequences
Action recognition
(body gestures / facial expressions)
Tracking
Structure from motion
Surveillance
References
References
Ivan Laptev, Patrick Prez, Retrieving actions in movies, in:
Proceedings of the ICCV07, Rio de Janeiro, Brazil, October 2007,
pp. 18.
Ivan Laptev, Marcin Marszaek, Cordelia Schmid, Benjamin
Rozenfeld, Learning realistic human actions from movies, in:
Proceedings of the CVPR 08, Anchorage, AK, June 2008, pp. 18.
Modeling Mutual Context of Object and Human Pose in Human-Object
Interaction Activities Bangpeng Yao and Li Fei-Fei IEEE CVPR 10. San
Francisco, CA, USA. June 13-18, 2010.
O. Boiman, M. Irani, Detecting irregularities in images and in video, in:
ICCV, Beijing, China, Oct 1521, 2005.
P. Felzenszwalb, D. Huttenlocher Pictorial Structures for Object
Recognition, IJCV Vol. 61, No. 1, January 2005
Deva Ramanan, Learning to parse images of articulated bodies, in:
NIPS, 19, Vancouver, Canada, December 2006
V. Ferrari, M. Marin, and A. Zisserman"Progressive Search Space
Reduction for Human Pose Estimation IEEE CVPR, Alaska, June2008.
References
Yang Wang and Greg Mori, Multiple Tree Models for Occlusion and
Spatial Constraints in Human Pose Estimation, ECCV, 2008
M Andriluka, S Roth, B Schiele, Pictorial Structures Revisited: People
Detection and Articulated Pose Estimation In: IEEE CVPR, 2009
M. Eichner and V. Ferrari "Better Appearance Models for Pictorial
Structures", BMVC, London, September 2009.
Paul A. Viola and Michael J. Jones. Rapid object detection using a
boosted cascade of simple features. In CVPR (1), 2001.
Matthew B. Blaschko, Christoph H. Lampert, "Learning to Localize
Objects
with
Structured
Output
Regression",
ECCV,
Marseilles,France,2008
A. Oikonomopoulos, I. Patras and M. Pantic, "Spatiotemporal
Localization and Categorization of Human Actions in
Unsegmented Image Sequences" . IEEE Trans. Image Processing,
vol. 20, no. 4, pp. 1126-1140, Mar. 2011
Multiple Kernels for Object Detection, A. Vedaldi, V. Gulshan, M.
Varma, and A. Zisserman, in Proceedings of the ICCV, 2009