Anda di halaman 1dari 72

Action recognition

ECS734 Techniques in Computer Vision


Ioannis Patras
i.patras@ecs.qmul.ac.uk

Slides thanks to Hays, Hoiem, Grauman, Oikonomopoulos

Past lectures
Recognition in static images
Object recognition
Image categorisation

Todays lectures
Recognition in image sequences
Action recognition
(body gestures / facial expressions)

To come
Recognition in image sequences
Action recognition
(body gestures / facial expressions)
Tracking
Structure from motion
Surveillance

Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects (in brief)
Part-based action localisation

Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects
Part-based action localisation

What is an action?

Action: a transition from one state to another

Who is the actor?


How is the state of the actor changing?
What (if anything) is being acted on?
How is that thing changing?
What is the purpose of the action (if any)?

Human activity in video


No universal terminology, but approximately:
Actions: atomic motion patterns -- often gesturelike, single clear-cut trajectory, single nameable
behavior (e.g., break eggs, lift spoon, kick, wave
arms)
Activity: series or composition of actions (e.g.,
people having a conversation, interacting)
Event: combination of activities or actions (e.g., a
football game, a traffic accident, cooking a meal)
Adapted from Venu Govindaraju

How do we represent actions?


Categories
Walking, hammering, dancing, skiing, sitting
down, standing up, jumping

Poses
Nouns and Predicates
<man, swings, hammer>
<man, hits, nail, w/ hammer>

Applications
Human-Computer
interfaces
Augmented Reality
Sports Analysis

[ C. Sminchisescu, 2007 ]

Surveillance

http://users.isr.ist.utl.pt/~etienne/mypubs/Auvinetal06PETS.pdf

Interfaces

2011

Interfaces

2011

1995

W. T. Freeman and C. Weissman, Television control by hand gestures, International Workshop on


Automatic Face- and Gesture- Recognition, IEEE Computer Society, Zurich, Switzerland, June,
1995, pp. 179--183. MERL-TR94-24

How can we identify actions?


Motion

Held
Objects

Pose

Nearby
Objects

Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects
Part-based action localisation

Representing Motion
Optical Flow with Motion History

Bobick Davis 2001

Representing Motion
Optical Flow with Split Channels

Efros et al. 2003

Representing Motion
Tracked Points

Matikainen et al. 2009

Representing
Motion
Space-Time Interest Points

Corner detectors in
space-time

Laptev 2005

Representing
Motion
Space-Time Interest Points

Laptev 2005

Representing Motion
Space-Time Volumes

Blank et al. 2005

Examples of Action Recognition


Systems
Feature-based classification
Recognition using pose and objects
Part-based recognition and localisation

Todays lecture
Introduction application
Features
Feature classification methods
Recognition using pose estimation and
objects
Part-based action localisation

Action recognition as classification

Retrieving actions in movies, Laptev and Perez, 2007

Remember image categorization


Training
Training
Images
Image
Features

Training
Labels

Classifier
Training

Trained
Classifier

Remember image categorization


Training
Training
Images
Image
Features

Training
Labels

Classifier
Training

Trained
Classifier

Testing
Image
Features

Test Image

Trained
Classifier

Prediction
Outdoor

Spatial pyramids.

Compute histogram in each spatial bin

Features for Classifying Actions


1. Spatio-temporal pyramids (14x14x8 bins)
Image Gradients
Optical Flow

Features for Classifying Actions


2. Spatio-temporal interest points

Corner detectors in
space-time

Descriptors based on Gaussian derivative filters over x, y, time

Classification
Boosted stubs for pyramids of optical flow,
gradient
Nearest neighbor for STIP

Searching the video for an action


1. Detect keyframes using a trained HOG
detector in each frame
2. Classify detected keyframes as positive (e.g.,
drinking) or negative (other)

Accuracy in searching video


With keyframe
detection

Without keyframe
detection

Talk on phone

Get out of car

Learning realistic human actions from movies, Laptev et al. 2008

Approach
Space-time interest point detectors
Descriptors
HOG, HOF

Pyramid histograms (3x3x2)


SVMs with Chi-Squared Kernel

Spatio-Temporal Binning
Interest Points

Results

Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects
Part-based action localisation

Action Recognition using Pose and


Objects

Modeling Mutual Context of Object and Human Pose in Human-Object


Interaction Activities, B. Yao and Li Fei-Fei, 2010
Slide Credit: Yao/Fei-Fei

Human-Object Interaction
Holistic image based classification

Integrated reasoning
Human pose estimation

Head

Torso

Slide Credit: Yao/Fei-Fei

Human-Object Interaction
Holistic image based classification

Integrated reasoning
Human pose estimation
Object detection

Tennis
racket

Slide Credit: Yao/Fei-Fei

Human-Object Interaction
Holistic image based classification

Integrated reasoning
Human pose estimation
Object detection
Action categorization
Head
Tennis
racket

Torso

HOI activity: Tennis Forehand


Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection

Human pose
estimation is
challenging.

Difficult part
appearance
Self-occlusion

Image region looks


like a body part

Felzenszwalb & Huttenlocher, 2005


Ren et al, 2005
Ramanan, 2006
Ferrari et al, 2008
Yang & Mori, 2008
Andriluka et al, 2009
Eichner & Ferrari, 2009

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection

Human pose
estimation is
challenging.

Felzenszwalb & Huttenlocher, 2005


Ren et al, 2005
Ramanan, 2006
Ferrari et al, 2008
Yang & Mori, 2008
Andriluka et al, 2009
Eichner & Ferrari, 2009

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection


Facilitate
Given the
object is
detected.

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection


Object
detection is
challenging

Small, low-resolution,
partially occluded

Image region similar


to detection target

Viola & Jones, 2001


Lampert et al, 2008
Divvala et al, 2009
Vedaldi et al, 2009

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection


Object
detection is
challenging

Viola & Jones, 2001


Lampert et al, 2008
Divvala et al, 2009
Vedaldi et al, 2009

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection


Facilitate
Given the
pose is
estimated.

Slide Credit: Yao/Fei-Fei

Human pose estimation & Object detection


Mutual Context

Slide Credit: Yao/Fei-Fei

Mutual Context Model Representation


A:
Activity

Human pose

Tennis Croquet Volleyball


forehand shot
smash

Object

O:

O
Tennis Croquet Volleyball
racket mallet

Body parts
P1

P2

PN

f1

f2

fN

H:
fO
Intra-class variations
More than one H for each A;
Unobserved during training.

P:

lP: location; P: orientation; sP: scale.

f:

Shape context. [Belongie et al, 2002]

Image evidence

Slide Credit: Yao/Fei-Fei

Activity Classification Results


Cricket
shot

0.9

Classification accuracy

83.3%
78.9%

0.8

0.7

0.6

0.5

52.5%

Our
Our
model

model

Gupta et

Tennis
forehand

Bag-of-

Gupta
al, 2009et Bag-of-words
Words
al, 2009 SIFT+SVM

Slide Credit: Yao/Fei-Fei

Todays lecture
Introduction application
Features
Template classification methods
Recognition using pose estimation and
objects
Part-based action localisation

Part-based Recognition&Localisation
Implicit shape model
Goal:
Recognize categories of
actions
Localize them in terms of their
bounding box (space +
time)
Challenges:
Occlusions, clutter, variations,
Hypothesis:
Analysis can be restricted on a set of

spatiotemporally interesting/salient events


52

Information theoretical spatial


saliency
Proposal: Use signal unpredictability as an
indicator of saliency
Spatial Saliency: Unpredictability in a single frame
HD=3.866

HD=7.201

53

Towards scale invariance


Entropy

The entropy maxima reveal the spatial scale(s) of a salient


region
1

0.8
0.6
0.4
0.2

0
-0.2
0

29

20

59

40

60

80

Scale (circle radius)

Detected salient points


in a single frame

54

Spatial and spatiotemporal


saliency
Entropy (HD)

Spatiotemporal Saliency:
Driven by signal unpredictability in a spatiotemporal
volume (cylinder / sphere)

Y vk

Examine
entropy:

w vk H vk

Entropys
peakness

w s, d , u

s
q

p D s , d , u dq

Entropys height
d
q

p D s , d , u dq
55

Descriptor extraction codebook


creation
t

Input sequence

Optical Flow

Spatiotemporal
Optical Flow
after median subtractionSalient Point Detection

c1
c2

cN

Codebook
(class-specific)

Feature selection
Ensemble codewords

Optical Flow + Spatial Gradient


Descriptors.
Bin in histograms and concatenate
Feature ensembles
O.Boiman & M.Irani [ICCV05]

56

Class-dependent
Spatio-temporal probabilistic
voting
Parameters stored for each ensemble e d in the
training set
average spatial position of ensemble with
X
respect to subject center and lower bound.
distance in frames of the activated ensemble
T
from the start/end of the action
average spatiotemporal scale of ensemble.
S

Localisation model learned for codeword/cluster c i :

| ci

wi

p ed | ci p

| ed

-t

ed

T-t

t
Current frame

ci

57

| ci

ed

Discriminative learning
Higher weights for pdfs with higher
localisation accuracy
wi

exp(

d p

| c i log p

| ci

Class dictionary comprise of


discriminative codewords
Adaboost on the codeword similarities

| ci

Spatio-temporal probabilistic
voting

59

Hypothesis verification with


RVM-based classification
Relevance Vector Machine (RMV) is variant of Support Vector Machine
Mean-shift responses F
f 1 , ..., f i ,...
used as features in RVM-based classification
D C ( F , F ')

K ( F , F ')

Two class classification problem for class l


N

cl ( F ; w )

w0

w jKl F , Fj
i

Select class l that maximizes the posterior probability


p l|F

1 e

cl F ; w

60

Localisation of single actions

61

Localisation accuracy (KTH)

Localisation accuracy (KTH)

Action recognition

KTH dataset average : 88%

HoHA dataset average : 37%


64

Localisation under artificial occlusions (KTH)

Localisation under clutter (KTH)

Summary
Advantages

Highly flexible structure model

Each part casts votes independently

Only few training examples are needed

Fast recognition

Robustness to occlusions

Disadvantages

Loose spatial model that does not model co-occurence of parts.

False positives in background (clutter)

Take-home messages
Action recognition is an open problem.

How to define actions?


How to infer them?
What are good visual cues?
How do we incorporate higher level reasoning?

Take-home messages
Some work done, but it is just the beginning of
exploring the problem. So far
Actions are mainly categorical
Most approaches are classification using simple features
(spatial-temporal histograms of gradients or flow, s-t interest
points, SIFT in images)
Just a couple works on how to incorporate pose and objects
Not much idea of how to reason about long-term activities or
to describe video sequences

To come
Recognition in image sequences
Action recognition
(body gestures / facial expressions)
Tracking
Structure from motion
Surveillance

References

C. Sminchisescu. Learning and Inference Algorithms for Monocular


Perception - Applications to Visual Object Detection, Localization and
Time Series Models for 3D Human Motion Understanding, 2007.
University of Bonn, Faculty of Mathematics and Natural Sciences.
Habilitation Thesis.
A. Bobick and J. Davis, The recognition of human movement using
temporal templates, IEEE Trans. PAMI., vol. 23, pp. 257267,
Mar 2001.
Alexei A. Efros, Alexander C. Berg, Greg Mori, and Jitendra Malik.
Recognizing action at a distance. In Proceedings of IEEE ICCV '03 Volume 2, 2003.
P Matikainen, M Hebert, and R Sukthankar.Trajectons: Action
recognition through the motion analysis of tracked features. In Workshop
on Video-Oriented Object and Event Classication, ICCV 2009
Ivan Laptev. On space-time interest points. International Journal of
Computer Vision, 64(2-3): 2005
B. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as spacetime shapes, in: ICCV, Beijing, China, Oct 1521, 2005.

References
Ivan Laptev, Patrick Prez, Retrieving actions in movies, in:
Proceedings of the ICCV07, Rio de Janeiro, Brazil, October 2007,
pp. 18.
Ivan Laptev, Marcin Marszaek, Cordelia Schmid, Benjamin
Rozenfeld, Learning realistic human actions from movies, in:
Proceedings of the CVPR 08, Anchorage, AK, June 2008, pp. 18.
Modeling Mutual Context of Object and Human Pose in Human-Object
Interaction Activities Bangpeng Yao and Li Fei-Fei IEEE CVPR 10. San
Francisco, CA, USA. June 13-18, 2010.
O. Boiman, M. Irani, Detecting irregularities in images and in video, in:
ICCV, Beijing, China, Oct 1521, 2005.
P. Felzenszwalb, D. Huttenlocher Pictorial Structures for Object
Recognition, IJCV Vol. 61, No. 1, January 2005
Deva Ramanan, Learning to parse images of articulated bodies, in:
NIPS, 19, Vancouver, Canada, December 2006
V. Ferrari, M. Marin, and A. Zisserman"Progressive Search Space
Reduction for Human Pose Estimation IEEE CVPR, Alaska, June2008.

References

Yang Wang and Greg Mori, Multiple Tree Models for Occlusion and
Spatial Constraints in Human Pose Estimation, ECCV, 2008
M Andriluka, S Roth, B Schiele, Pictorial Structures Revisited: People
Detection and Articulated Pose Estimation In: IEEE CVPR, 2009
M. Eichner and V. Ferrari "Better Appearance Models for Pictorial
Structures", BMVC, London, September 2009.
Paul A. Viola and Michael J. Jones. Rapid object detection using a
boosted cascade of simple features. In CVPR (1), 2001.
Matthew B. Blaschko, Christoph H. Lampert, "Learning to Localize
Objects
with
Structured
Output
Regression",
ECCV,
Marseilles,France,2008
A. Oikonomopoulos, I. Patras and M. Pantic, "Spatiotemporal
Localization and Categorization of Human Actions in
Unsegmented Image Sequences" . IEEE Trans. Image Processing,
vol. 20, no. 4, pp. 1126-1140, Mar. 2011
Multiple Kernels for Object Detection, A. Vedaldi, V. Gulshan, M.
Varma, and A. Zisserman, in Proceedings of the ICCV, 2009

Anda mungkin juga menyukai