Anda di halaman 1dari 56

Discrimina)ve

Virtual Views for Cross- View Ac)on Recogni)on


Ruonan Li and Todd Zickler
Harvard School of Engineering and Applied Sciences

Cross-View Recogni)on
kick

View 1

Cross-View Recogni)on
kick ?

View 1

View 2

Cross-View Recogni)on
kick ?

View 1

View 2

Cross-View Recogni)on
x

Cross-View Recogni)on
W
x

Our approach: 1. Dimension expansion for discrimina)ve cross-view features;

Our Approach: Cross-View Dimension Expansion


W
x

x = W

Our approach: 1. Dimension expansion for discrimina)ve cross-view features;

Our Approach: Cross-View Dimension Expansion


W
x

x = W

Our approach: 1. Dimension expansion for discrimina)ve cross-view features; 2. Exploit various cross-view annota)on types that were considered separately before.
8

Case 1: Weakly Labeled Target View


View 1 (Source View) View 2 (Target View)

Case 1: Weakly Labeled Target View


Domain Adapta)on/Cross-Domain Recogni)on
Domain 1 (Source Domain) Domain 2 (Target Domain)

Ben-David 2007; Blitzer 2006, 2007; Daume III 2007; Duan 2009, 2010, 2011; Jhuo 2012; Kulis 2011; Pan 2010, 2011; Saenko 2010; and many others.

Case 2: Unlabeled Correspondence


View 1 (Source View) View 2 (Target View)

The same unknown ac)on executed by the same subject and simultaneously observed in both views

Farhadi 2008, 2009; Liu 2011

Case 3: No Supervision in the Target View


View 1 (Source View) View 2 (Target View)

Gong 2012; Gopalan 2011;

Intra-Class Similarity
View 1 (Source View) View 2 (Target View)

kick

kick

x = W

Inter-Class Separability
View 1 (Source View) View 2 (Target View)

kick

wave hand

x = W

x c) x x x; c) =I( ;x) H( |c) H( |c) H( = H( ) x ( (5) x x H( ) P (cx) P (cx=)1)H( P= P (c = N ), x = H(= 1)H( P PI() c). x max (cx; 1)H( 1)H( N ), (
AS ,AT

either case, weSAS ,AT like,ATInter-Class Separability d Intra-Class would AS to maximize our ability to imilarity and minate between the two classes in all available label ( max we max I(Wx; c) (4) mples. To this end, I(Wx; c) transformations AS and A seek W W at maximize the mutual information between cross-vie ote thatNote that ature x and the class label c {1, 1}:

x max I(max I( ; c). x; c).

(3)

so (3) can be written of the differential entropy ) can be written in terms in terms of the differential entrop x . H( ). max I(Wx; c) ( To we approximate approximate differential entropy x solve (3),solve (3), we W differential entropy H( ) H( x using a of samples. Assuming that the that the a nite set nite set of samples. Assuming samples sampl Note that of cross-view are drawn from a Gaussian distri oss-view feature xfeature x are drawn from a Gaussian distr

Unlabeled Correspondence
View 1 (Source View) View 2 (Target View) The same unknown ac)on executed by the same subject and simultaneously observed in both views

x = W

In either case, we would like to maximize our ability to discriminate between the two classes in all available (4) labeled ( max I(Wx; c) max I(Wx; c) W W samples. To this end, we seek transformations AS and AT maximize ote thatNote that the mutual information between cross-view feature x and the class label c {1, 1}: x c) x x x; c) =I( ;x) H( |c) H( |c) H( = H( ) x ( (5) x x (3) H( ) P (cx) P (cx=maxPI(= P (c = N ), x = H(= 1)H( P )1)H(x;)c). (cP 1)H( 1)H( N ), x
AS ,AT

Unlabeled Correspondence
AS ,AT AS ,AT

x max I(max I( ; c). x; c).

(3)

so (3) can be written of the differential entropy ) can be written in terms in terms of the differential entrop x . H( ). max I(Wx; c) H(Wx Wx ) (4) To solveW we approximate differentialH( ) H( (3), x solve (3), we approximate differential entropy entropy x Note of using athat samples. Assuming that the that the a nite set nite set of samples. Assuming samples sampl of cross-view are drawn from a Gaussian distri oss-view feature xfeature x are drawn from a Gaussian distr

In either case, we would like to maximize our ability to discriminate between the two classes in all available (4) labeled ( max I(Wx; c) max I(Wx; c) W W samples. To this end, we seek transformations AS and AT maximize ote thatNote that the mutual information between cross-view feature x and the class label c {1, 1}: x c) x x x; c) =I( ;x) H( |c) H( |c) H( = H( ) x ( (5) x x (3) H( ) P (cx) P (cx=maxPI(= P (c = N ), x = H(= 1)H( P )1)H(x;)c). (cP 1)H( 1)H( N ), x
AS ,AT

Cross-View Transform Objec)ve


AS ,AT AS ,AT

x max I(max I( ; c). x; c).

(3)

so (3) can be written of the differential entropy ) can be written in terms in terms of the differential entrop x . H( ). max I(Wx; c) H(Wx Wx ) (4) To solveW we approximate differentialH( ) H( (3), x solve (3), we approximate differential entropy entropy x Note of using athat samples. Assuming that the that the a nite set nite set of samples. Assuming samples sampl of cross-view are drawn from a Gaussian distri oss-view feature xfeature x are drawn from a Gaussian distr

Cross-View Transform Objec)ve

1) Ei 1 1 max ln det all ln det Class1 ln det Class2 ln det , oth W 2 2 (8) 3. Fo Assump)ons: where istribu)on for lass prior; Uniform d is the ccorrelation matrix, not the covariance ex matrix, for all or A minimization of det will yield x Gaussian distribu)on fs.cross-view features. 1) s concentrating around 0, by which we enforce the corx

Cross-View Transform Objec)ve

1) E 1 1 max ln det all ln det Class1 ln det Class2 ln det , ot W 2 2 (8) 3. Fo where is the correlation matrix, not the covariance ex matrix, for all s. A minimization of det will yield x 1) s concentrating around 0, by which we enforce the corx

Cross-View Transform Objec)ve

1) E 1 1 max ln det all ln det Class1 ln det Class2 ln det , ot W 2 2 (8) 3. Fo where is the correlation matrix, not the covariance ex matrix, for all s. A minimization of det will yield x 1) s concentrating around 0, by which we enforce the corx

viewing directions. In a = Wx Wx spatio-temporal features 1) resent the underlying acAnother emerging family of approachesE a 1 ful in discriminating be- ln det Class1 1recognition by ln det , ot view action ln det Class2 adapting feature max ln det all W 2 2 from similar viewpoints, tions, or recognition models trained on one or (8) 3. Fo pear quite different when views to a target view where the recognition where is the correlation matrix, not the covariance ex , the utility of these feaperformed [8, 7, 14]. This boils down to drawi matrix, for all s. A minimization of det will yield x 1) nt changes more signiof statistical connections between view-depen s concentrating around 0, by which we enforce the corx

viewpoints. One approach is to infer three scene structure so that the derived action desc Cross-View Transform Objec)ve through g adapted from one view to another soning [29, 26, 15, 10, 4], while another is spatio-temporal features of a video sequence t cognizing human actions sitive to changes in view angle [17, 21, 23, 2 viewpoint. Opportunities cent view-invariant approaches include [27] a domains such as surveilformer learns a classier on examples taken d human-computer interviews, and the latter introduces a temporal s 1], but realizing this pomatrix and demonstrates its view stability emp curately interpret human

Cross-View Transform

Source View

Target View

AS

Cross-View Transform

Source View

Target View

AS

AT

Cross-View Transform

Source View

Target View

AS

AT

Cross-View Transform

Source View

Target View

AS

AT
W
x

Cross-View Transform

Source View

Target View

AS

AS

Cross-View Transform

Source View

Target View

AS

AT
AS
x

=
AT

ne approach is to infer three-dimensional so that the derived action descriptors can be Cross-View Transform ne view to another through geometric rea15, 10, 4], while another is to search for features of a video sequence that are insens in view angle [17, 21, 23, 22, 3, 28]. ReantSource View approaches include [27] and [13]. The classier on examples taken from various AS latter introduces a temporal self-similarityAT onstrates its view stability empirically.
x A1 A2 A3

Target View

AS

= rging family of approaches address crossognition AT adapting features, representaby

(1)

ne approach is that infer be used for all ons, or models to can three-dimensional so that the derived action descriptors can be approach is to Cross-View Transform infer three-dimensional ne viewderived action descriptors can bereathat the to another through geometric 15, to 4], while another is to search view10, another through geometric rea- for features while another is to search insen5, 10, 4], of a video sequence that arefor s in view video sequence that are 28]. atures of aangle [17, 21, 23, 22, 3,insen- ReView antSource angle [17, 21, 23,[27] 3, 28]. Re-The approaches include 22, and [13]. n view tclassier on include [27] and [13]. The approaches examples taken from various AS latter introduces a taken from various assier on examplestemporal self-similarityAT onstrates its view stability self-similarity ter introduces a temporal empirically. strates its AS stability empirically. view x A1 A2 A3 (1) = A1 A2 A3 (1) rging family of approaches address crossing family T adapting features, representaognition Aof approaches address crossby

Target View

ck approach is to infer three-dimensional to search ne is or modelsfor view-invariant fea-all ons, that can be used for , that the derivedcan soor models thatinfer be used for allcan be action descriptors approach is to Cross-View Transform three-dimensional proach derived action descriptors can ne viewis to infer three-dimensional bereathat the to another through geometric tviewderived while another is to search for the to 4], action descriptors can rea15, 10, another through geometricbe w 10, another video sequence that rea-for to 4], while another is to search features of a through geometric are insen5, 0, view video sequence search for while another is 23, 22, 3, 28]. s in4], of aangle [17, 21, tothat are insen- Reatures View es of a video sequence that are 28]. ReantSource angle [17, 21, 23,[27] 3, insen- The approaches include 22, and [13]. n view 21, classier on include [27] and from various tew angle [17,examples taken [13]. The approaches AS 23, 22, 3, 28]. Repproaches includea[27] and from various latter introduces assier on examplestemporal self-similarityAT taken [13]. The er on examples taken from various onstrates its view stability self-similarity ter introduces a temporal empirically. introduces strates its ASa temporal empirically. view stability self-similarity x tes its view1stability empirically. A A2 A3 (1) = A1 A2 A3 (1) rging A A of approaches address crossA1 family (1) 2 3 ing family T adapting features, representaognition Aof approaches address crossby

Target View

scene structure so that the derived action desc scenesoningfrom one viewderived while another structure so 26, 15, 10, another through g [29, that the to 4], action descrip adapted adapted from one V15, 10, Views video sequen view to another another is spatio-temporal features of a through geo soning [29, 26, man actions Discrimina)ve irtual 4], while soning [29,to changes in4], of aangle [17, 21, tot sitive 26, 15, 10, view video sequence spatio-temporal features while another is 23 pportunities an actions spatio-temporal features of a video sequence tha cent view-invariant approaches include 2 sitive to changes in view angle [17, 21, 23,[2 ctions as surveilportunities sitive to changes in view angle [17, 21, 23, 22, 3 former learns a approaches examples tak cent view-invariantclassier on include [27] a unitiesinterputer as surveil- cent view-invariant approaches include [27] and views, and a classier on examples taken former learns the latter introduces a tempor urveil- pong this uter inter- former learns a classier on examples taken fro matrix views, and demonstrates its view stability s a temporal interpret human views, andand the latter introducestemporal self g this pothe latter introduces a matrix and demonstrates its AS stability em view is po- In a ret human matrix and demonstrates its view stability empir tions. A1 A2 Ax humanIn a 3 ons. ral features = A1 A2 A3 al In a acfeatures derlying Another emerging A A of approache A1 family 2 3 atures acerlying Another emerging family of approaches a minating beview action recognition by adapting feat AT ng ac- benating view action recognition by adapting onadd viewpoints, Another or recognition models trainedfeature tions, emerging family of approaches on ng be-when view views to a target view adapting features,o action recognition by where the recogni iewpoints, tions, or recognition models trained on one erent points, fea- tions,performed [8, 7,view This boilson one to dr or to a target 14]. where the recognition rent when viewsrecognition models trained down or m f these

scene structure so that the derived action de scenesoningfrom one viewderived while anoth structure so 26, 15, 10, another through [29, that the to 4], action descr adapted adapted from one irtual Views a video seque view 10, another another i to 4], while spatio-temporal features of through ge soning [29, 26, Discrimina)ve V 15, uman actions soning [29,to changes in4], of aangle [17, 21, t sitive 26, 15, 10, view video sequence spatio-temporal features while another is Opportunities man actions spatio-temporal features of a video sequence th actions All nd cent view-invariant approaches include [ virtual v changes in view angle [17, 21, 23, ch as surveil- real asitive toiews pportunities sitive to changes in view angle [17, 21, 23, 22 former learns a approaches examples t cent view-invariantclassier on include [27] rtunitiesinterhmputer as surveil- cent view-invariant approaches include [27] an views, and a classier on examples take former learns the latter introduces a temp surveil- pozing inter- former learns a classier on examples taken f mputerthis Source views, and the latter introduces a tempora er inter- po- view matrix and demonstrates its view stabilit erpret human views, and the latter introduces a temporal se ing this matrix its A view stability em this popret human ections. In a matrix andand demonstrates viewSstability emp demonstrates its A1 A2 Ax human 3 ctions. In a poral features = A1 A2 A3 ns. In a acoral features nderlying Another emerging A A of approac A1 family 2 3 features acderlying beAnother emerging family of approaches minating view action recognition by adapting fe AT yingTarget view acminating beview action recognition by adapting onad r viewpoints, Another or recognition models trainedfeatur tions, emerging family of approaches o ting be-when view views to a target view adapting features action recognition by where the recog viewpoints, tions, or recognition models trained on one fferent wpoints, fea- tions,performed [8, 7,view This boilson one to d or to a target 14]. where the recogniti ferent when viewsrecognition models trained down or of these

scene structure so that the derived action de scenesoningfrom one viewderived while anoth structure so 26, 15, 10, another through [29, that the to 4], action descr adapted adapted from one irtual Views a video seque view 10, another another i to 4], while spatio-temporal features of through ge soning [29, 26, Discrimina)ve V 15, uman actions soning [29,to changes in4], of aangle [17, 21, t sitive 26, 15, 10, view video sequence spatio-temporal features while another is Opportunities man actions spatio-temporal features of a video sequence th actions All nd cent view-invariant approaches include [ virtual v changes in view angle [17, 21, 23, ch as surveil- real asitive toiews pportunities sitive to changes in view angle [17, 21, 23, 22 former learns a approaches examples t cent view-invariantclassier on include [27] rtunitiesinterhmputer as surveil- cent view-invariant approaches include [27] an views, and a classier on examples take former learns the latter introduces a temp surveil- pozing inter- former learns a classier on examples taken f mputerthis Source views, and the latter introduces a tempora er inter- po- view matrix and demonstrates its view stabilit erpret human views, and the latter introduces a temporal se ing this matrix its A view stability em this po- In a pret`Virtual human ections. views matrix andand demonstrates viewSstability emp demonstrates its A1 A2 Ax human 3 ctions. In a poral features = A1 A2 A3 ns. In a acoral features nderlying Another emerging A A of approac A1 family 2 3 features acderlying beAnother emerging family of approaches minating view action recognition by adapting fe AT yingTarget view acminating beview action recognition by adapting onad r viewpoints, Another or recognition models trainedfeatur tions, emerging family of approaches o ting be-when view views to a target view adapting features action recognition by where the recog viewpoints, tions, or recognition models trained on one fferent wpoints, fea- tions,performed [8, 7,view This boilson one to d or to a target 14]. where the recogniti ferent when viewsrecognition models trained down or of these

scene structure so that the derived action de scenesoningfrom one viewderived while anoth structure so 26, 15, 10, another through [29, that the to 4], action descr adapted adapted from one irtual Views a video seque view 10, another another i to 4], while spatio-temporal features of through ge soning [29, 26, Discrimina)ve V 15, uman actions soning [29,to changes in4], of aangle [17, 21, t sitive 26, 15, 10, view video sequence spatio-temporal features while another is Opportunities man actions spatio-temporal features of a video sequence th actions All nd cent view-invariant approaches include [ virtual v changes in view angle [17, 21, 23, ch as surveil- real asitive toiews pportunities sitive to changes in view angle [17, 21, 23, 22 former learns a approaches include t on examples cent view-invariantclassierDiscrimina)ve [27] rtunitiesinterhmputer as surveil- cent view-invariant approaches include [27] an Cross-View views, and a classier on xpansion a take former learns the latter introduces W temp Dimension E examples surveil- pozing inter- former learns a classier on examples taken f mputerthis Source views, and the latter introduces a tempora er inter- po- view matrix and demonstrates its view stabilit erpret human views, and the latter introduces a temporal se ing this matrix its A view stability em this po- In a pret`Virtual human ections. views matrix andand demonstrates viewSstability emp demonstrates its A1 A2 Ax human 3 ctions. In a poral features = A1 A2 A3 ns. In a acoral features nderlying Another emerging A A of approac A1 family 2 3 features acderlying beAnother emerging family of approaches minating view action recognition by adapting fe AT yingTarget view acminating beview action recognition by adapting onad r viewpoints, Another or recognition models trainedfeatur tions, emerging family of approaches o ting be-when view views to a target view adapting features action recognition by where the recog viewpoints, tions, or recognition models trained on one fferent wpoints, fea- tions,performed [8, 7,view This boilson one to d or to a target 14]. where the recogniti ferent when viewsrecognition models trained down or of these

l features of a video sequence orthonormal propose anas46.4 44.2 cros that are insen54.2 50.8 58.1 49.5 46.9 be WeMIXSVM approach We sum will that are insenmatrices well. for 52 atures 44.2 52.3 47.7 44.7 variantof a angle [17, 21, 23,[27] 3,tion,OurwhichThe 62.0 and target64 es 46.4view video sequence 22,byand [13]. obtain the optimal R in approaches include 28]. Rein Approach algorithm which we the source 65.5 vi n view angle64.5 include [27] and [13].Sby a smooth virtual path repr 21, 57.9 consequently Re62.0 65.5 s a approaches The Overall O nected (t), on 69.5 23, 22, 3, 28]. A The from various and riantclassier [17,examples takenbjec)ve AT (t) from AS (t 1) of The transformations of action t classier on includea[27] and from linear correspondence principle approaches examples taken [13]. various in Algorithmself-similarity 1. for he a latter introduces temporal cially The mathematicalselected disc ear transformations are mode. Th W algorithm assiermode. examples stability empirically.of use gradient com various ondence introduces a temporal involves approximate relatively sma emonstrates These observations im-self-similarity mutualainformation in latter on its view taken from plymeasure may a that one is briey H(Wx in arg virtual use a introduces a temporalmaxand views c) introducedclasstheper vie relatively smaller number SO(D) I(Wx; and lower dimensions Appen of self-similarity views and Wx ) Th ter the virtual labels. A monstrates itsview unless a details,AS SO(D) canisbe found, for example, in view AS stability Tempirically. on accuracy desired. a variety of wea dimensions per very high anism under x strates its view1stability empirically. AT operates(1) T AT = I d. A A2 A3 subject to AS (matched source-target pairs, par ios S = I and A T = 4.3. Non-Discriminative Virtual out matched A1 A2 A3 minative Virtual Views approaches address (1)pairs, and no target lab merging A A of Algorithm 1 Greedy Axis Rotation. S , AT and A The(1) considered A transformations have been cross- quite separatel A1 family 2 3 mutual information maximization to tions AS AT andA approaches erging , familyAofadapting features, representa- to or improves u formance compares recognition by are learned throughaddress crossT on maximization toapproachesaction Acategories. AT (t 1), > 0, > 0 Input: representaing familyby adapting features, orS (t 1),source the performanc crosscognition of discriminate1. address more How will ognition modelsbe affected on one transformations are not learned disc trained if these Appendix: Approximate Gradient will the performance nitionby whereinterpola)on 2 swer iew will D, nition view discriminatively? 2. For representa-tquestion, f let S , AT arget learned adapting features,of vmoreConsider thebe jwerom A compu re not models trained on one or irtual vsource 1 generic optimization To anClose-form the recognition i task iransforms this + D, on we let Awhere; be therecognition task Ej,ibe and JS,i,j source an morei,j some form source n, 7,view and A T boilsonbases to the S ,et models ,trained down ofexp((Eprincipal subspaces of max =J(RA 14]. This the one ordrawing will )), directly the J(R compute A es of where and target 1)) 1), AT viewthe source the downsamples1),task some be J(AS (t RSO(D)(t 1 recognitionre-AT will , 14].computeboilsfollowingto drawing(tspectively, and This Abetween view-dependent form connections features j)th element is rectly 2.1 with- is aout the optimizations in 2.2 or 2.3. Ei,j matrix whose 4]. This boilsviewing directions. Thissteepest ascent to bridging the nnections 2.3.down to drawing some features (i, direction is between view-dependent formattracduces our The is ons in 2.2 or rem different This modicationothers are zero; approach non-discriminative projections, simi ch to bridging the source and target byThisspect to R. ections between view-dependent features The gradient on SO(D) different viewing directions. is itprojections, similar to the method of Gopalanso(D),ex- Inso(D) 3, we c reduces reliance on accurately J attrac- [11]. Table is the a inferring where et. al. e

l features of a video sequence orthonormal propose anas46.4 44.2 cros that are insen54.2 50.8 58.1 49.5 46.9 be WeMIXSVM approach We sum will that are insenmatrices well. for 52 atures 44.2 52.3 47.7 44.7 variantof a angle [17, 21, 23,[27] 3,tion,OurwhichThe 62.0 and target64 es 46.4view video sequence 22,byand [13]. obtain the optimal R in approaches include 28]. Rein Approach algorithm which we the source 65.5 vi n view angle64.5 include [27] and nected (t), smooth virtual path repr 62.0 65.5 s a approaches 69.5 23, 22, 3, 28]. A The on 21, 57.9 consequently Refrom and riantclassier [17,examples taken [13].Sby a AT (t) from AS (t 1) Solu)on linearvarious of The transformations of action t classier on includea[27] and from various approaches examples taken [13]. Thecorrespondence principle in Algorithmself-similarity 1. he a latter introduces temporal cially for mathematicalselected disc ear transformations are mode. Th W algorithm assiermode. examples stability empirically.of use gradient com various ondence introduces a temporal involves approximate relatively sma emonstrates These observations im-self-similarity mutualainformation in latter on its view taken from plymeasure may a that one is briey H(Wx in arg virtual use a introduces a temporalmaxand views c) introducedclasstheper vie relatively smaller number SO(D) I(Wx; and lower dimensions Appen of self-similarity views and Wx ) Th ter the virtual labels. A monstrates itsview unless a details,AS SO(D) canisbe found, for example, in view AS stability Tempirically. on accuracy desired. a variety of wea dimensions per very high anism under x strates its view1stability empirically. AT operates(1) T AT = I d. A A2 A3 subject to AS (matched source-target pairs, par ios S = I and A T = 4.3. Non-Discriminative Virtual out matched A1 A2 A3 minative Virtual Views approaches address (1)pairs, and no target lab merging A A of Algorithm 1 Greedy Axis Rotation. S , AT and A The(1) considered A transformations have been cross- quite separatel A1 family 2 3 mutual information maximization to tions AS AT andA approaches erging , familyAofadapting features, representa- to or improves u formance compares recognition by are learned throughaddress crossT on maximization toapproachesaction Acategories. AT (t 1), > 0, > 0 Input: representaing familyby adapting features, orS (t 1),source the performanc crosscognition of discriminate1. address more How will ognition modelsbe affected on one transformations are not learned disc trained if these Appendix: Approximate Gradient will the performance nitionby whereinterpola)on 2 swer iew will D, nition view discriminatively? 2. For representa-tquestion, f let S , AT arget learned adapting features,of vmoreConsider thebe jwerom A compu re not models trained on one or irtual vsource 1 generic optimization To anClose-form the recognition i task iransforms this + D, on we let Awhere; be therecognition task Ej,ibe and JS,i,j source an morei,j some form source n, 7,view and A T boilsonbases to the S ,et models ,trained down ofexp((Eprincipal subspaces of max =J(RA 14]. This the one ordrawing will )), directly the J(R compute A es of where and target 1)) 1), AT viewthe source the downsamples1),task some be J(AS (t RSO(D)(t 1 recognitionre-AT will , 14].computeboilsfollowingto drawing(tspectively, and This Abetween view-dependent form connections features j)th element is rectly 2.1 with- is aout the optimizations in 2.2 or 2.3. Ei,j matrix whose 4]. This boilsviewing directions. Thissteepest ascent to bridging the nnections 2.3.down to drawing some features (i, direction is between view-dependent formattracduces our The is ons in 2.2 or rem different This modicationothers are zero; approach non-discriminative projections, simi ch to bridging the source and target byThisspect to R. ections between view-dependent features The gradient on SO(D) different viewing directions. is itprojections, similar to the method of Gopalanso(D),ex- Inso(D) 3, we c reduces reliance on accurately J attrac- [11]. Table is the a inferring where et. al. e

a classier onW examples taken from various cially for correspondence mode. disc earof linear transformations of acti transformations are selected Th of that one linear algorithm assiermode. examples stability self-similarity mutualainformation in on These a temporal involves approximate relatively sma taken from plymeasuretransformations of action various may ondence introduces observations im- empirically.of use gradient com emonstrates its view latter a ear selected ear and H(Wx SO(D) and is briey introduced in Wx ) vied arg virtualim- views c)transformationsselected disc max I(Wx;transformations arearetheper Th views and class labels. use a introduces These observations relatively smaller temporal self-similarity lower dimensions Appen number of espondence its These stability imter theavirtual of of mutual information ondence mode. A a A ,AS monstratesmode.viewobservationsTempirically. isbe found, for example,in a measure S unless a details on SO(D) measure mutual information in can accuracy desired. a variety of wea dimensions per smaller numbervery high view ay userelatively smaller number virtual use a a relatively stabilityof of virtual anism operates under class labels. Th the virtual strates its view1 A2 Ax empirically. AT virtual views and class labels. d. dimensions per view unless very high to AS the = IviewsAT AT = pairs, par A unless3a a very high ios anism operatesand T a variety of w (1)under I subject and er (matched source-target Virtual S operates under a variety of wea dimensions per view = anism 4.3. Non-Discriminative ired. outios (matched source-target pairs, matched d. A1 A2 A3 ios (matched source-target pairs, pa minative Virtual Views approaches address (1)pairs, and no target lab merging A A of Algorithm 1 Greedy Axis Rotation. S , AT and A The(1) considered and separatel transformations have been cross- andA no target A1 familyViews 2 3 outout matched pairs, quite target lab matched pairs, no criminative Virtualare learned through minativefamilyAViews Virtual ofadapting features, representa- to or improves u mutual information maximization to tions AS erging , AT andby approaches address cross- considered quite separa formance compares quite separatel recognition AT have been have been considered categories. AT (t 1), to> 0, > 0 on maximization and A are learned throughAS (tformance compares performanc 1), ing familyby toapproachesaction orformancesource to or improves u cross- How mationsSAA and adapting features, representa- will tions A , Smodels are learned onInput: through cognition,TATofAdiscriminate1. address more compares the or improve ognition willmaximization to totrained ifaction the performance discriminate these be discriminate one transformations are not learned disc affected ation maximization action Appendix: Approximate Gradient on nitionmodels trained on one o representa-tquestion, we let S , AT by where the recognition i task will D, nition the Close-form ibe affected ifan-f vmoreAppendix:1Approximate compu arget learned adapting features,theseirtual ConsiderApproximate Gradient re not view discriminatively? 2. these 2 swer source be j from AGradie To performance nterpola)on or v this the iew iransforms D, w will Appendix:+ generic optimization will the performance be affected if For on notnot learned T boilsonbases Toexp((Eprincipalj,ibe and JS,i,j source an one drawing some form of the = J(R more source n, 7,view and A discriminatively?orthean- i,j will )), let This the et models ; be down of anwhere s, are 14].AS ,trainedtherecognition task E subspaces generic optimizat re we learned discriminatively? toTo Consider Consider the optimization es of we let AS , and target bases of1),task spectively, J(AS generic compute 1 the source the recognitionthethe(t will be and directlymax T (t A samples of AT 1)) the (t 1), A J(RA retion, AT T be the bases view let A S ,aboils down to drawing(some form n, 14]. wherend A beini)alized with LDA or PCA whenever LDA is the , we compute between 2.1 with- out the features in 2.2 or 2.3. This connectionsAandfollowing view-dependent optimizations RSO(D) J(RA rectly max paces of the source and target samples re- a matrix whose (i, j)th elementJ(R es of the source down to samples re- some form target drawing Ei,j unavailable); view-dependent features ascent to max is the 4]. This compute Thisfollowing 2.1 re- is Thissteepest boilsviewing directions. duces our attrac- direction is RSO(D) nnections 2.3. between bridging ons in compute modication with- The is m different AAfollowing 2.1 with- are zero; approach RSO(D) directly rectly 2.2 or others to R. ch to 2.22.2viewing modication by re- non-discriminative projections,is the ections between view-dependent features The ascent directionsimit different reliance on accurately The attrac- exisThe steepest gradient on is zationsbridgingor 2.3. This and target re-Thisspectsteepest ascent directionSO(D) onsreduces 2.3. This directions. in in or the source modication inferring Gopalanso(D), [11]. Inso(D) 3, we c J et. al. where Table is the a eitprojections, similar to the method of

ew c0 c0 c1of ac2 c2 c3 sequence that are insenc4 l features c1 58.1 49.5c3 46.9c4be orthonormal propose anas46.4 44.2 cros video will 5. MIXSVM approach for 52 5. WeConclusion well. We sum Conclusion 54.2 50.8 matrices T 38.5 43.4 video sequence that are insen38.5 43.4 50.3 51.0 35.1 50.3 51.0 atures of a angle [17, 21, 35.1 [27] 3,tion,OurwhichThe 62.0 and target64 variant 44.2 52.3 47.7 44.723, 22,byand [13]. obtain the optimal R include Approach es 46.4view50.8 58.1 49.5 46.9 in 54.2approaches 49.5 algorithm which propose ansource 65.5 for vi 28]. we the an approach cros in ReM 54.2 50.8 58.1 46.9 We propose approach for c We n view angle64.5 include 44.7 taken [13].Sby which(t) fromandStarget1) [17,examples and nected which thethe virtual and targe 21, 57.9 consequentlytion, The T source A path v Re62.0 65.5 69.5 23, 22, 3, 28]. A (t), A s 46.4approaches 47.7 and from various M a classier on 47.7 44.7 44.2 52.3 riant 46.4 44.2 52.3 [27] tion, in in a smooth source (t repr of linear mathematical of action ach 62.0 65.5 64.5 69.5 a 57.9Algorithmself-similarity virtual path repr 62.0 introduces 65.5 64.5 69.5 in 57.9 t approaches include [27] and [13]. nected transformations principle 1. The a a smooth The he latter temporal nected by bysmooth virtual path

Solu)on

a classier onW examples taken from various cially for correspondence mode. disc earof linear transformations of acti transformations are selected Th of linear algorithm involves one may 5. on These observations from ply thatapproximate relatively sma assiermode. examples stability self-similarity mutualainformation in various transformations of action ondence Conclusion view taken im- empirically.of use gradient com emonstrates its a temporal latter introduces a measure selected ear SO(D) im- views c)transformationsselected vied arg cross-view action earand introducedclassWx ) disc maxand is briey lower dimensions per Th I(Wx;transformations arearethe Appen H(Wx use a introduces an approach for virtual relatively smaller temporal self-similarity views and in labels. number of espondence propose These observations S ter We its These stability imtheavirtual of of mutual information recogniondence mode. A a A ,A monstratesmode.viewobservationsTempirically. isbe found, for example,in a measure desired. S unless a details on SO(D) measure mutual information in can dimensions which the sourcenumber ofhigh ay userelatively smaller numbertarget views are explicitly operates under a varietylabels. a relatively smaller and very virtual accuracy views and class labels.wea anism con- views and class of Th use a tion, in per viewstabilityof virtual the virtual strates its view1 A2 Ax empirically. AT virtual (1) T AT = I d. dimensions a view unless3apath very high asiossequenceI and A Aviewvirtual very high to Aa the = operates under a variety of w subject S (matched source-target er nected per smooth unless a represented 4.3.anism S dimensions by per = anism operates under aT pairs, wea variety of par Non-Discriminative Virtual ired. of linearA1 A2 A3 of action descriptors. ios (matched source-target pairs, transformations outThe linmatched d. minative Virtual Views approaches address (1)pairs, and no target lab (i+1) ios (matched source-target pairs, pa (i) merging A A are selected discriminatively= been cross- quite AT and A The transformations A A Algorithm 1 Greedy Axispairs, andAS , no target ear A1 family of transformations based on Rotation.no separatel have (1) considered and target lab 2 3 Views outout matched criminative Virtualareinformation in a trainingmutual matched pairs, (i) improves u minativefamilyAViews Virtual ofadapting features, representa- to or tions AS erging , AT andby approaches addressbetweencompares maximization to a measure mutual set haveinformation R formance considered quite separatel recognitionof AT learned through been have cross- considered quite separa been l maximization to and class labels. action Acategories. AT (t 1),optperformanc on familybyand A are learnedThis view-transfer mech- How will the > 0, > 0 1), ing ASAAmodelsdiscriminate1. throughorSformancesource to or improves u cross- compares mations , S and adapting features, representations the virtual views are learned onInput: through cognition,TATofAapproaches address (tformance compares to or improve ognition operates under a variety ofifweakly supervised scenar- Approximate Gradient one transformations are not learned disc more h the performance be affected these Appendix: willmaximization to totrained action anism ation maximization discriminate on nitionview discriminatively? 2.partialf vmorevsource 1 we let S , AT by whereidiscriminateFor 2 swer iew will nitionlearned adapting features,theseirtual Consider thebe j from A compu modelssource-targeton onean-representa-tquestion, optimization trainedaffected ifaction labelsthis iransforms D, Gradie argetios(matched be be pairs,To o target Appendix:+ Approximate the recognition i task Approximate Gradient re not the Close-form nterpola)on or with D, Appendix: generic w will performance affected if these will the performance on notnot matched pairs,theonbases Toexp((Eprincipalj,ibe and JS,i,j source an one drawing will more source n, 7,view and AT boils down toTo or task E subspaces of the = J(R let This ; and no target labels et models discriminatively? of anwhere be s, are 14].AS ,trainedtherecognition matches), which)),thethe generic optimizat re we out learned discriminatively?orthean- i,j Considerform learned someand directlymax J(RA Consider optimization es of have let Aconsideredthetheseparately. task spectively, J(AS generic compute 1 the source , and target bases of1),Inthe(t will be rebeen, S the recognitionthe all cases,1)) perquitesamples of AT our tion, where AT T be bases viewweThis aboils down to drawing(some form (t LRSO(D)(t A n, 14].let A S nd A beini)alized with LDA or PCA whenever DA is AT , weformance compares to or improveswith- the state of the features 1), connectionsAandfollowing view-dependent optimizations inmax orJ(RA betweensamplesEre- re- aout the art. h rectly compute target max 2.3. paces of the source and target 2.1 upon is matrix whose (i, j)th 2.2 samples es of the source down to drawing some form elementJ(R is i,j unavailable); view-dependent features 4]. This compute Thisfollowing 2.1 re- Thissteepest ascent to bridging the RSO(D) RSO(D) nnections 2.3. Afollowing 2.1 withbetween n directly boilsviewing directions. duces our attrac- direction is ons in compute m different A modicationothers are zero; approach with- The is rectly 2.2 or isThe R. steepest e ch to Appendix:the This modication Ascent n SO(D) . ections between view-dependentofeatures The ascent directionsimit different viewing directions.by re- non-discriminative projections, is zationsbridgingorApproximate Gradient scent onspect tosteepest gradient on SO(D) in Approximate gradient a 2.3. This modication onsreduces 2.3. source and target re-ThisThe attrac- ascent direction is the in 2.22.2 reliance on accurately inferring exor Gopalanso(D), [11]. Inso(D) 3, we c J et. al. where Table is the a eitprojections, similar to the method of

ew c0via a smoothac2 c2has more thoroughlythat are insenc0 c1 c1 path, c3 c3 c4 c4 l features of 58.1 49.5sequence orthonormalconvideo 46.9 be exploited WeConclusion46.4 44.2 cros the propose Conclusion approach for 52 54.2 50.8 will that are5. 5. MIXSVM an as well. We sum matrices T 38.5 43.4 video sequence 38.5 between50.3source and 35.1 43.4 50.3 51.0 35.1 51.0 atures of a angle [17, 21, the target, so 3,tion,OurwhichThe 62.0 and target64 insen- the source 65.5 vi additional variant 44.2 52.3the 47.7 44.723,[27] that 28]. we obtain the optimal R approaches 49.5 algorithm which Reinclude [13]. Approach es 46.4view50.8 58.1 49.5 46.9 22,byandWeWe propose an approach for c in nection in propose an approach for cros M 54.2 50.8 54.2 views only contribute limited additional discrimina58.1 46.9 source n view angle64.5 include 44.7 taken [13].Sby which(t) fromandStarget1) Re62.0 65.5 s a classier52.3 examples and nected which thethe virtual and targe on 21, 44.7 consequently A The from various M 46.4 44.2 46.4 52.3 47.7 and riant tion. 44.2 [17,69.5 23,[27] 3, 28].tion, (t), AT source A path v approaches 47.7 57.9 22, e tion, in in a smooth source (t repr of linear mathematical of action ach 62.0 65.5 64.5 69.5 a 57.9Algorithmself-similarity virtual path repr 62.0 introduces 65.5 64.5 69.5 in 57.9 t approaches include [27] and [13]. nected transformations principle 1. The a a smooth The he latter temporal nected by bysmooth virtual path

Solu)on

adapted from one view 10, another video seqg to 4], while another spatio-temporal features of a through soning [29, 26, 15, human actions soning [29,to changes in4], of aangle [17, 21 26, 15, sitive Recap 10, view video sequen spatio-temporal features while another is Opportunities x I( human actions I( ; c). x; c). (3) max max cent view-invariantof a video sequence spatio-temporal features (3) angle [17, 21, 23 approaches include A an actionsS ,AT AS ,ATsitive to changes in view uch as surveilOpportunities sitive to changes in view angle [17, 21, 23, 2 former learns a approaches examples cent view-invariantclassier on include [2 ortunitiesinteromputer ch as surveil- cent view-invariant approaches include [27] views, examples tak s surveil- po- I(Wx; c) and the latter introduces a tem max former learns a classier onW (4) max I(Wx; c) lizing inter- former learns a classier (4) examples taken omputerthis on W matrix and demonstrates its view stabil uter inter- po- W views, and the latter introduces a tempor terpret human views, and the latter introduces a temporal izing this matrix its A view stability ghat po- In a this erpret human matrix andand demonstrates viewSstability em rections. demonstrates its A1 A2 Ax et humanIn a 3 ections. mporal features = = H( x ( ) In a H( x A1 A2 A3 ons.H( ) x|c) x|c) poral features underlying acAnother emerging A A of approa (5)family A1), 2 (5) 3 ) = (c = x x(c P1)H( P xP (c Paction1)H( N 1)H( )Another= x Priminating be- ) Pview 1)H( N ), xfamily ofadapting f = (c emerging l features acnderlying recognition by approach AT rlying ac- beiminating view action recognition by adapting on ar viewpoints, Another or recognition models trainedfeat tions, emerging family of approaches n viewpoints, view the the be bewritten in terms oftions,differential entropy adapting feature nating written in terms ofor to differential entropy theon on action recognitionmodels trained reco r different when views recognition view where a target by ewpoints, fea- tions,performed [8, 7,view This boilson one to or to a target 14]. where the recogni ifferent when y of Discrimina)ve viewsrecognition models trained down o these cross-view `dimension expansion; x ve when , of (3), we approximate differential14]. where the) down to dr we approximate differential [8, viewH( ) boils x ent these fea- viewsof statistical7, entropy H( recognitio to a target connections performed entropy This between view s more Uniformly account Assuming that the samples signifor that the nite set of samples. statistical cross-view semi-supervisions. et of feasamples. performedv[8, 7, connections between to draw Assumingarious 14].samples down view-de these signimore ofextracted from This boilsviewing direct different

IXMAS Dataset
11 ac)on categories performed by 10 actors, taken from 5 views.
Check-watch Scratch-head
View0

Sit-down

Wave-hand

Kicking

Pick-up

View1

View2

View3

View4

Case 1 Result: Weakly Labeled Target View

View Mixing: A SVM trained on the union of source and target views

siers for each action in each view does not scale well due the requirement of excessive labeled training data, so a possible lineResult: is to search Labeled Target View Case 1 of attack Weakly for view-invariant features, representations, or models that can be used for all viewpoints. One approach is to infer three-dimensional scene structure so that the derived action descriptors can be adapted from one view to another through geometric reasoning [29, 26, 15, 10, 4], while another is to search for spatio-temporal features of a video sequence that are insensitive to changes in view angle [17, 21, 23, 22, 3, 28]. Recent view-invariant approaches include [27] and [13]. The former learns a classier on examples taken from various views, and the latter introduces a temporal self-similarity matrix and demonstrates its view stability empirically.
View Concatena)on: Daume III 2007

x = [xT , 0T , xT ]T , x = [0T , xT , xT ]T

(1)

Another emerging family of approaches address cross-

Case 1 Result: Weakly Labeled Target View

owledge transfer approach is the work of Farhadi et. al. , 7], who rely on simultaneous multi-view observations of e same action instance to explicitly identify maps between e views features and those of another, thereby allowing classier learned in one view to be adapted by suitably reganizing its weights. Another example is the work of Liu al. [14] who rely on the same style of input to learn a oss-view bag of bilingual words representation in which ch bilingual word represents the co-occurrence of one vial word in one view with another visual word in another ew.
Combining Classier: Schweikert 2008

instances. We r ments show that petitive performa the rst two mod in the third.

2. Discrimina

We propose a different approach to view knowledge

f (x) = f S (x) + (1 )f T (x)

(1)

Consider sour ine that they ar 0 1, with this virtual path in camera positi mations of actio action descriptor

Case 1 Result: Weakly Labeled Target View

Case 2 Result: Unlabeled Correspondence

Farhadi 2008,2009

Case 2 Result: Unlabeled Correspondence

egory, we must either transnguage or translate both of used in machine translation ng two heterogenous action anslate) them into a comview interlingua. Hence, ngua from two views (voOn the other hand, we no-

Liu 2011

Unlabelled Ac on Videos A1

BoVW models M1

View 1

Visual Words Vocabulary V1 Vocabulary V2

examples

Bipar te Graph Par oning

Source view Target view Training data vectors M Bipar te Graph BoBW models

View 2

Bilingual
Unlabelled Ac on Videos A2 BoVW models M2

Words

Figure 2. The process of discovering bilingual-words. Two vocabular-

Case 2 Result: Unlabeled Correspondence

Case 3 Result: No Supervision in the Target View

Case 3 Result: No Supervision in the Target View

Gopalan 2011

Case 3 Result: No Supervision in the Target View

Non-discrimina)ve virtual views

Case 3 Result: No Supervision in the Target View

Four Source Views + Single Target View


Case 1

Case 2

Summary
A dimension expansion transform to produce cross-view features which (1) are discrimina)ve; (2) encode the transi)on from the source to the target; -- See also Gong et. al., CVPR 2012 Accommodate various types of semi-supervisions, including (1) weakly labeled target view; (2) Unlabeled correspondence between two views; (3) completely unsupervised target view. Applicable to general domain transfer/adapta)on problems.

scene structure so that imize the mutual information betweenthe the derived action d the mutual information between cross-view scenesoningfrom one viewderived while anot structure so 26, 15, 10, another throug that cross-viewaction desc [29, adapted 1}: to 4], and the class {1, {1, he class label clabel c 1}: one view to another through g adapted from 26, 15, 10, ross-View spatio-temporal features while another Discrimina)ve Virtual Views for C4],of a video seq soning [29, human actions soning [29,to ecogni)on viewaangle [17, 21 sitive 26, 15, 10, Ac)on R changes in4], while spatio-temporal features of (3) another is video sequen Opportunities x I( human actions I( ; c). x; c). max max cent view-invariantof a video sequence (3) spatio-temporal features approaches include A an actionsS ,AT AS ,ATsitive to changes in view angle [17, 21, 23 uch as surveilOpportunities sitive to changes in view angle [17, 21, 23, 2 former learns a approaches examples cent view-invariantclassier on include [2 ortunitiesinteromputer ch as surveil- cent view-invariant approaches include [27] views, examples tak s surveil- po- I(Wx; c) and the latter introduces a tem max max former learns a classier (4) lizing this I(Wx; c) learns a classier (4) onW omputer inter- W on examples taken W matrix and demonstrates its view stabil views, and the latter introduces a tempor uter inter- po- former terpret human views, and the latter introduces a temporal izing this matrix its A view stability ghat po- In a this erpret human matrix andand demonstrates viewSstability em rections. demonstrates its A1 A2 Ax et humanIn a 3 ections. mporal features = = H( x ( ) In a H( x A1 A2 A3 ons.H( ) x|c) x|c) poral features underlying acAnother emerging A A of approa (5)family A1), 2 (5) 3 ) = (c = x x(c P1)H( P xP (c Paction1)H( N 1)H( )Another= x Priminating be- ) Pview 1)H( N ), xfamily ofadapting f = (c emerging l features acnderlying recognition by approach AT rlying ac- beiminating view action recognition by adapting on ar viewpoints, Another or recognition models trainedfeat tions, emerging family of approaches n viewpoints, view the the be bewritten in terms oftions,differential entropy adapting feature nating written in terms ofor to differential entropy theon on action recognitionmodels trained reco r different when views recognition view where a target by ewpoints, fea- tions,performed [8, 7,view This boilson one to or to a target 14]. where the recogni ifferent when viewsrecognition models trained down o y of these

Varying Number of Virtual Views and Varying Dimensions of Virtual Views