Lecture On Action Recognition

Action recognition
ECS734 Techniques in Computer Vision

Ioannis Patras
i.patras@ecs.qmul.ac.uk
Slides thanks to Hays, Hoiem, Grauman, Oikonomopoulos
Past lectures
Recognition in static images
Object recognition
Image categorisation
Todays lectures
Recognition in image sequences
Action recognition
(body gestures / facial expressions)
To come
Action recognition
Tracking
Structure from motion
Surveillance
Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects (in brief)
Part-based action localisation
Todays lecture
Features
objects
What is an action?
Action: a transition from one state to another
Who is the actor?

How is the state of the actor changing?
What (if anything) is being acted on?
How is that thing changing?
What is the purpose of the action (if any)?
Human activity in video

No universal terminology, but approximately:
Actions: atomic motion patterns -- often gesturelike, single clear-cut trajectory, single nameable
behavior (e.g., break eggs, lift spoon, kick, wave
arms)
Activity: series or composition of actions (e.g.,
people having a conversation, interacting)
Event: combination of activities or actions (e.g., a
football game, a traffic accident, cooking a meal)
Adapted from Venu Govindaraju
How do we represent actions?

Categories
Walking, hammering, dancing, skiing, sitting
down, standing up, jumping
Poses
Nouns and Predicates
<man, swings, hammer>
<man, hits, nail, w/ hammer>
Applications
Human-Computer
interfaces
Augmented Reality
Sports Analysis

[ C. Sminchisescu, 2007 ]
Surveillance
http://users.isr.ist.utl.pt/~etienne/mypubs/Auvinetal06PETS.pdf
Interfaces
2011
Interfaces
2011
1995
W. T. Freeman and C. Weissman, Television control by hand gestures, International Workshop on

Automatic Face- and Gesture- Recognition, IEEE Computer Society, Zurich, Switzerland, June,
1995, pp. 179--183. MERL-TR94-24
How can we identify actions?

Motion
Held
Objects
Pose
Nearby
Objects
Todays lecture
Features
objects
Representing Motion
Optical Flow with Motion History
Bobick Davis 2001
Representing Motion
Optical Flow with Split Channels
Efros et al. 2003
Representing Motion
Tracked Points
Matikainen et al. 2009
Representing
Motion
Space-Time Interest Points
Corner detectors in
space-time
Laptev 2005
Representing
Motion
Space-Time Interest Points
Laptev 2005
Representing Motion
Space-Time Volumes
Blank et al. 2005
Examples of Action Recognition

Systems
Feature-based classification
Recognition using pose and objects
Part-based recognition and localisation
Todays lecture
Features
Feature classification methods
objects
Action recognition as classification
Retrieving actions in movies, Laptev and Perez, 2007
Remember image categorization

Training
Training
Images
Image
Features
Training
Labels
Classifier
Training
Trained
Classifier
Remember image categorization

Training
Training
Images
Image
Features
Training
Labels
Classifier
Training
Trained
Classifier
Testing
Image
Features
Test Image
Trained
Classifier
Prediction
Outdoor
Spatial pyramids.
Compute histogram in each spatial bin
Features for Classifying Actions

1. Spatio-temporal pyramids (14x14x8 bins)
Image Gradients
Optical Flow
Features for Classifying Actions

2. Spatio-temporal interest points
Corner detectors in
space-time
Descriptors based on Gaussian derivative filters over x, y, time
Classification
Boosted stubs for pyramids of optical flow,
gradient
Nearest neighbor for STIP
Searching the video for an action

1. Detect keyframes using a trained HOG
detector in each frame
2. Classify detected keyframes as positive (e.g.,
drinking) or negative (other)
Accuracy in searching video

With keyframe
detection
Without keyframe
detection
Talk on phone
Get out of car
Learning realistic human actions from movies, Laptev et al. 2008
Approach
Space-time interest point detectors
Descriptors
HOG, HOF
Pyramid histograms (3x3x2)

SVMs with Chi-Squared Kernel
Spatio-Temporal Binning
Interest Points
Results
Todays lecture
Features
objects
Action Recognition using Pose and

Objects
Modeling Mutual Context of Object and Human Pose in Human-Object

Interaction Activities, B. Yao and Li Fei-Fei, 2010
Slide Credit: Yao/Fei-Fei
Human-Object Interaction
Holistic image based classification
Integrated reasoning
Human pose estimation
Head
Torso
Object detection
Tennis
racket
Object detection
Action categorization
Head
Tennis
racket
Torso
HOI activity: Tennis Forehand

Human pose estimation & Object detection
Human pose
estimation is
challenging.
Difficult part
appearance
Self-occlusion
Image region looks

like a body part
Felzenszwalb & Huttenlocher, 2005

Ren et al, 2005
Ramanan, 2006
Ferrari et al, 2008
Yang & Mori, 2008
Andriluka et al, 2009
Eichner & Ferrari, 2009
Human pose
estimation is
challenging.
Felzenszwalb & Huttenlocher, 2005

Ren et al, 2005
Ramanan, 2006
Ferrari et al, 2008
Yang & Mori, 2008
Andriluka et al, 2009
Eichner & Ferrari, 2009

Facilitate
Given the
object is
detected.

Object
detection is
challenging
Small, low-resolution,
partially occluded
Image region similar

to detection target
Viola & Jones, 2001

Lampert et al, 2008
Divvala et al, 2009
Vedaldi et al, 2009

Object
detection is
challenging
Viola & Jones, 2001

Lampert et al, 2008
Divvala et al, 2009
Vedaldi et al, 2009

Facilitate
Given the
pose is
estimated.

Mutual Context
Mutual Context Model Representation

A:
Activity
Human pose
Tennis Croquet Volleyball

forehand shot
smash
Object
O:
O
Tennis Croquet Volleyball
racket mallet
Body parts
P1
P2
PN
f1
f2
fN
H:
fO
Intra-class variations
More than one H for each A;
Unobserved during training.
P:
lP: location; P: orientation; sP: scale.
f:
Shape context. [Belongie et al, 2002]
Image evidence
Activity Classification Results

Cricket
shot
0.9
Classification accuracy
83.3%
78.9%
0.8
0.7
0.6
0.5
52.5%
Our
Our
model
model
Gupta et
Tennis
forehand
Bag-of-
Gupta
al, 2009et Bag-of-words
Words
al, 2009 SIFT+SVM
Todays lecture
Features
objects
Part-based Recognition&Localisation
Implicit shape model
Goal:
Recognize categories of
actions
Localize them in terms of their
bounding box (space +
time)
Challenges:
Occlusions, clutter, variations,
Hypothesis:
Analysis can be restricted on a set of
spatiotemporally interesting/salient events

52
Information theoretical spatial

saliency
Proposal: Use signal unpredictability as an
indicator of saliency
Spatial Saliency: Unpredictability in a single frame
HD=3.866
HD=7.201
53
Towards scale invariance

Entropy
The entropy maxima reveal the spatial scale(s) of a salient

region
1
0.8
0.6
0.4
0.2
0
-0.2
0
29
20
59
40
60
80
Scale (circle radius)
Detected salient points

in a single frame
54
Spatial and spatiotemporal

saliency
Entropy (HD)
Spatiotemporal Saliency:
Driven by signal unpredictability in a spatiotemporal
volume (cylinder / sphere)
Y vk
Examine
entropy:
w vk H vk
Entropys
peakness
w s, d , u
s
q
p D s , d , u dq
Entropys height
d
q
p D s , d , u dq
55
Descriptor extraction codebook

creation
t
Input sequence
Optical Flow
Spatiotemporal
Optical Flow
after median subtractionSalient Point Detection
c1
c2
cN
Codebook
(class-specific)
Feature selection
Ensemble codewords
Optical Flow + Spatial Gradient

Descriptors.
Bin in histograms and concatenate
Feature ensembles
O.Boiman & M.Irani [ICCV05]
56
Class-dependent
Spatio-temporal probabilistic
voting
Parameters stored for each ensemble e d in the
training set
average spatial position of ensemble with
X
respect to subject center and lower bound.
distance in frames of the activated ensemble
T
from the start/end of the action
average spatiotemporal scale of ensemble.
S
Localisation model learned for codeword/cluster c i :
| ci
wi
p ed | ci p
| ed
-t
ed
T-t
t
Current frame
ci
57
| ci
ed
Discriminative learning
Higher weights for pdfs with higher
localisation accuracy
wi
exp(
d p
| c i log p
| ci
Class dictionary comprise of

discriminative codewords
Adaboost on the codeword similarities
| ci
Spatio-temporal probabilistic
voting
59
Hypothesis verification with

RVM-based classification
Relevance Vector Machine (RMV) is variant of Support Vector Machine
Mean-shift responses F
f 1 , ..., f i ,...
used as features in RVM-based classification
D C ( F , F ')
K ( F , F ')
Two class classification problem for class l

N
cl ( F ; w )
w0
w jKl F , Fj
i
Select class l that maximizes the posterior probability

p l|F
1 e
cl F ; w
60
Localisation of single actions
61
Localisation accuracy (KTH)
Localisation accuracy (KTH)
Action recognition
KTH dataset average : 88%
HoHA dataset average : 37%

64
Localisation under artificial occlusions (KTH)
Localisation under clutter (KTH)
Summary
Advantages
Highly flexible structure model
Each part casts votes independently
Only few training examples are needed
Fast recognition
Robustness to occlusions
Disadvantages
Loose spatial model that does not model co-occurence of parts.
False positives in background (clutter)
Take-home messages
Action recognition is an open problem.
How to define actions?

How to infer them?
What are good visual cues?
How do we incorporate higher level reasoning?
Take-home messages
Some work done, but it is just the beginning of
exploring the problem. So far
Actions are mainly categorical
Most approaches are classification using simple features
(spatial-temporal histograms of gradients or flow, s-t interest
points, SIFT in images)
Just a couple works on how to incorporate pose and objects
Not much idea of how to reason about long-term activities or
to describe video sequences
To come
Action recognition
Tracking
Structure from motion
Surveillance
References
C. Sminchisescu. Learning and Inference Algorithms for Monocular

Perception - Applications to Visual Object Detection, Localization and
Time Series Models for 3D Human Motion Understanding, 2007.
University of Bonn, Faculty of Mathematics and Natural Sciences.
Habilitation Thesis.
A. Bobick and J. Davis, The recognition of human movement using
temporal templates, IEEE Trans. PAMI., vol. 23, pp. 257267,
Mar 2001.
Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik.
Recognizing action at a distance. In Proceedings of IEEE ICCV '03 Volume 2, 2003.
P Matikainen, M Hebert, and R Sukthankar.Trajectons: Action
recognition through the motion analysis of tracked features. In Workshop
on Video-Oriented Object and Event Classication, ICCV 2009
Ivan Laptev. On space-time interest points. International Journal of
Computer Vision, 64(2-3): 2005
B. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as spacetime shapes, in: ICCV, Beijing, China, Oct 1521, 2005.
References
Ivan Laptev, Patrick Prez, Retrieving actions in movies, in:
Proceedings of the ICCV07, Rio de Janeiro, Brazil, October 2007,
pp. 18.
Ivan Laptev, Marcin Marszaek, Cordelia Schmid, Benjamin
Rozenfeld, Learning realistic human actions from movies, in:
Proceedings of the CVPR 08, Anchorage, AK, June 2008, pp. 18.
Modeling Mutual Context of Object and Human Pose in Human-Object
Interaction Activities Bangpeng Yao and Li Fei-Fei IEEE CVPR 10. San
Francisco, CA, USA. June 13-18, 2010.
O. Boiman, M. Irani, Detecting irregularities in images and in video, in:
ICCV, Beijing, China, Oct 1521, 2005.
P. Felzenszwalb, D. Huttenlocher Pictorial Structures for Object
Recognition, IJCV Vol. 61, No. 1, January 2005
Deva Ramanan, Learning to parse images of articulated bodies, in:
NIPS, 19, Vancouver, Canada, December 2006
V. Ferrari, M. Marin, and A. Zisserman"Progressive Search Space
Reduction for Human Pose Estimation IEEE CVPR, Alaska, June2008.
References
Yang Wang and Greg Mori, Multiple Tree Models for Occlusion and
Spatial Constraints in Human Pose Estimation, ECCV, 2008
M Andriluka, S Roth, B Schiele, Pictorial Structures Revisited: People
Detection and Articulated Pose Estimation In: IEEE CVPR, 2009
M. Eichner and V. Ferrari "Better Appearance Models for Pictorial
Structures", BMVC, London, September 2009.
Paul A. Viola and Michael J. Jones. Rapid object detection using a
boosted cascade of simple features. In CVPR (1), 2001.
Matthew B. Blaschko, Christoph H. Lampert, "Learning to Localize
Objects
with
Structured
Output
Regression",
ECCV,
Marseilles,France,2008
A. Oikonomopoulos, I. Patras and M. Pantic, "Spatiotemporal
Localization and Categorization of Human Actions in
Unsegmented Image Sequences" . IEEE Trans. Image Processing,
vol. 20, no. 4, pp. 1126-1140, Mar. 2011
Multiple Kernels for Object Detection, A. Vedaldi, V. Gulshan, M.
Varma, and A. Zisserman, in Proceedings of the ICCV, 2009

Lecture On Action Recognition

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Lecture On Action Recognition

Diunggah oleh

Hak Cipta:

Format Tersedia

Action recognition

ECS734 Techniques in Computer Vision

Slides thanks to Hays, Hoiem, Grauman, Oikonomopoulos

Action: a transition from one state to another

Who is the actor?

Human activity in video

How do we represent actions?

W. T. Freeman and C. Weissman, Television control by hand gestures, International Workshop on

How can we identify actions?

Bobick Davis 2001

Efros et al. 2003

Matikainen et al. 2009

Blank et al. 2005

Examples of Action Recognition

Action recognition as classification

Retrieving actions in movies, Laptev and Perez, 2007

Remember image categorization

Remember image categorization

Compute histogram in each spatial bin

Features for Classifying Actions

Features for Classifying Actions

Descriptors based on Gaussian derivative filters over x, y, time

Searching the video for an action

Accuracy in searching video

Get out of car

Learning realistic human actions from movies, Laptev et al. 2008

Pyramid histograms (3x3x2)

Action Recognition using Pose and

Modeling Mutual Context of Object and Human Pose in Human-Object

Slide Credit: Yao/Fei-Fei

Slide Credit: Yao/Fei-Fei

HOI activity: Tennis Forehand

Human pose estimation & Object detection

Image region looks

Felzenszwalb & Huttenlocher, 2005

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection

Felzenszwalb & Huttenlocher, 2005

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection

Image region similar

Viola & Jones, 2001

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection

Viola & Jones, 2001

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection

Slide Credit: Yao/Fei-Fei

Mutual Context Model Representation

Tennis Croquet Volleyball

lP: location; P: orientation; sP: scale.

Shape context. [Belongie et al, 2002]

Slide Credit: Yao/Fei-Fei

Activity Classification Results

Slide Credit: Yao/Fei-Fei

spatiotemporally interesting/salient events

Information theoretical spatial

Towards scale invariance

The entropy maxima reveal the spatial scale(s) of a salient