Wangmeng Zuo
Human Centric
Visual Analysis
with Deep
Learning
Human Centric Visual Analysis with Deep Learning
Liang Lin Dongyu Zhang
• •
123
Liang Lin Dongyu Zhang
School of Data and Computer Science School of Data and Computer Science
Sun Yat-sen University Sun Yat-sen University
Guangzhou, Guangdong, China Guangzhou, Guangdong, China
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Foreword
When Liang asked me to write the foreword to his new book, I was very happy and
proud to see the success that he has achieved in recent years. I have known Liang
since 2005, when he visited the Department of Statistics of UCLA as a Ph.D.
student. Very soon, I was deeply impressed by his enthusiasm and potential in
academic research during regular group meetings and his presentations. Since 2010,
Liang has been building his own laboratory at Sun Yat-sen University, which is the
best university in southern China. I visited him and his research team in the summer
of 2010 and spent a wonderful week with them. Over these years, I have witnessed
his fantastic success of him and his group, who set an extremely high standard. His
work on deep structured learning for visual understanding has built his reputation as
a well-established professor in computer vision and machine learning. Specifically,
Liang and his team have focused on improving feature representation learning with
several interpretable and context-sensitive models and applied them to many
computer vision tasks, which is also the focus of this book. On the other hand, he
has a particular interest in developing new models, algorithms, and systems for
intelligent human-centric analysis while continuing to focus on a series of classical
research tasks such as face identification, pedestrian detection in surveillance, and
human segmentation. The performance of human-centric analysis has been sig-
nificantly improved by recently emerging techniques such as very deep neural
networks, and new advances in learning and optimization. The research team led by
Liang is one of the main contributors in this direction and has received increasing
attention from both the academy and industry. In sum, Liang and his colleagues did
an excellent job with the book, which is the most up-to-date resource you can find
and a great introduction to human-centric visual analysis with emerging deep
structured learning.
If you need more motivation than that, here is the foreword:
In this book, you will find a wide range of research topics in human-centric
visual analysis including both classical (e.g., face detection and alignment) and
newly rising topics (e.g., fashion clothing parsing), and a series of state-of-the-art
solutions addressing these problems. For example, a newly emerging task, human
parsing, namely, decomposing a human image into semantic fashion/body regions,
v
vi Foreword
is deeply and comprehensively introduced in this book, and you will find not only
the solutions to the real challenges of this problem but also new insights from which
more general models or theories for related problems can be derived.
To the best of our knowledge, to date, a published systematic tutorial or book
targeting this subject is still lacking, and this book will fill that gap. I believe this
book will serve the research community in the following aspects:
(1) It provides an overview of the current research in human-centric visual
analysis and highlights the progress and difficulties. (2) It includes a tutorial in
advanced techniques of deep learning, e.g., several types of neural network
architectures, optimization methods, and techniques. (3) It systematically discusses
the main human-centric analysis tasks on different levels, ranging from face/human
detection and segmentation to parsing and other higher level understanding. (4) It
provides effective methods and detailed experimental analysis for every task as well
as sufficient references and extensive discussions.
Furthermore, although the substantial content of this book focuses on
human-centric visual analysis, it is also enlightening regarding the development of
detection, parsing, recognition, and high-level understanding methods for other AI
applications such as robotic perception. Additionally, some new advances in deep
learning are mentioned. For example, Liang introduces the Kalman normalization
method, which was invented by Liang and his students, for improving and accel-
erating the training of DNNs, particularly in the context of microbatches.
I believe this book will be very helpful and important to academic
professors/students as well as industrial engineers working in the field of vision
surveillance, biometrics, and human–computer interaction, where human-centric
visual analysis is indispensable in analyzing human identity, pose, attributes, and
behaviors. Briefly, this book will not only equip you with the skills to solve the
application problems but will also give you a front-row seat to the development of
artificial intelligence. Enjoy!
Alan Yuille
Bloomberg Distinguished Professor of Cognitive Science
and Computer Science
Johns Hopkins University, Baltimore, Maryland, USA
Preface
vii
viii Preface
ix
x Contents
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5.2 PASCAL-Person-Part Dataset . . . . . . . . . . . . . . . . . . . 80
6.5.3 CIHP Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.5.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7 Video Instance-Level Human Parsing . . . . . . . . . . . . . . . . . . . . . . . 85
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Video Instance-Level Parsing Dataset . . . . . . . . . . . . . . . . . . . . 86
7.2.1 Data Amount and Quality . . . . . . . . . . . . . . . . . . . . . . 87
7.2.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3 Adaptive Temporal Encoding Network . . . . . . . . . . . . . . . . . . . 87
7.3.1 Flow-Guided Feature Propagation . . . . . . . . . . . . . . . . 90
7.3.2 Parsing R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.3 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . 91
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Abstract The past decade has witnessed the rapid development of feature represen-
tation learning, especially deep learning. Deep learning methods have achieved great
success in many applications, including computer vision, and natural language pro-
cessing. In this chapter, we present a short review of the foundation of deep learning,
i.e., artificial neural network, and introduce some new techniques in deep learning.
Neural networks, the foundation of deep learning models, are biologically inspired
systems that are intended to simulate the way in which the human brain processes
information. The human brain consists of a large number of neurons that are highly
connected by synapses. The arrangement of neurons and the strengths of the indi-
vidual synapses, determined by a complex chemical process, establish the function
of the neural network of the human brain. Neural networks are excellent tools for
finding patterns that are far too complex or numerous for a human programmer to
extract and teach the machine to recognize.
The beginning of neural networks can be traced to the 1940s, when the single
perceptron neuron was proposed, and only over the past several decades have neural
networks become a major part of artificial intelligence. This is due to the development
of backpropagation, which allows multilayer perceptron neural networks to adjust
the weights of neurons in situations where the outcome does not match what the
creator is hoping for. In the following, we briefly review the background of neural
networks, including the perceptron, multilayer perceptron, and the backpropagation
algorithm.
1.1.1 Perceptron
introduce weights w j to each input to account for the difference. The perceptron sums
the weighted inputs and produces a single binary output with its activation function,
f (x), which is defined as
1, i f j wjxj + b > 0
f (x) = (1.1)
0, other wise.
where w j is the weight and b is the bias, which shifts the decision boundary away
from the origin.
The perceptron with one output can only be used for binary classification prob-
lems. As with most other techniques for training linear classifiers, the perceptron
naturally generalizes to multiclass classification. Here, the input x and the output y
are drawn from arbitrary sets. A feature representation function f (x, y) maps each
possible input/output pair to a finite-dimensional real-valued feature vector. The fea-
ture vector is multiplied by a weight vector w, but the resulting score is now used to
choose among many possible outputs:
Perceptron neurons are a type of linear classifier. If the dataset is linearly separable,
then the perceptron network is guaranteed to converge. Furthermore, there is an upper
bound on the number of times that the perceptron will adjust its weights during the
training. Suppose that the input vectors from the two classes can be separated by a
hyperplane with a margin γ, and let R denote the maximum norm of an input vector.
Input Output
Layer Layer
…
…
…
…
First Second
Hidden Hidden
Layer Layer
1
f (x) = , (1.3)
1 + exp{−(w · x + b)}
where x is the input vector and w is the weight vector. With the sigmoid function, the
output of the neuron is no longer just the binary value 1 or 0. In general, the sigmoid
function is real-valued, monotonic, smooth, and differentiable, having a nonnegative
first derivative that is bell shaped. The smoothness of the sigmoid function means
that small changes w j in the weights and b in the bias will produce a small change
out put, which is well approximated by
∂out put ∂out put
out put ≈ w j + b, (1.4)
j
∂w j ∂b
Output Layer 1
w=-2 w=1
Hidden
Layer 1 1
1
w=0.5 w=1
w=1 w=0.5
Input
Layer
1
“XOR” problem Solution of “XOR” problem
with perception
Fig. 1.2 Left: The illusion of “XOR” Problem. Right: The solution of “XOR” problem with per-
ceptions
To conveniently describe the neural network, we use the following parameter settings.
Let n l denote the total number of layers in the neural network, and let L l denote the
lth layer. Thus, L 1 and L nl are the input layer and the output layer, respectively.
We use (W, b) = (W (1) , b(1) , W (2) , b(2) , . . .) to denote the parameters of the neural
network, where Wi(l) j denotes the parameters of connections between unit j in layer
l and unit i in layer l + 1. Additionally, bi(l) is the bias associated with unit i in layer
l + 1. Thus, in this case, W (1) ∈ R3×3 , and W (2) ∈ R1×3 . We use ai(l) to denote the
activation of unit i in layer l. Given a fixed setting of the parameters (W, b), the
neural network defines a hypothesis as h W,b (x). Specifically, the computation that
this neural network represents is given by
a1(2) = f W11 (1)
x1 + W12(1) (1)
x2 + W13 x3 + b1(1) ,
a2(2) = f W21 (1)
x1 + W22(1) (1)
x2 + W23 x3 + b2(1) ,
(1.5)
a3(2) = f W31 (1)
x1 + W32(1) (1)
x2 + W33 x3 + b3(1) ,
h W,b (x) = a1(3) = f W11 (2) (2)
a1 + W12 (2) (2)
a2 + W13 (2) (2)
a3 + b1(2) .
1.1 Neural Networks 7
Let z i(l) denote the total weighted sum of inputs to unit i in layer l, including
the bias term (e.g., z i(2) = nj=1 Wi(1) (1) (l) (l)
j x j + bi ), such that ai = f (z i ). If we
extend the activation function f (·) to apply to vectors in an elementwise fashion
as f ([z1, z2, z3]) = [ f (z1), f (z2), f (z3)], then we can write the above equations
more compactly as
z (2) = W (1) x + b(1) ,
a (2) = f (z (2) ),
(1.6)
z (3) = W (2) a (2) + b(2) ,
h W,b (x) = a (3) = f (z (3) ).
We call this step forward propagation. Generally, recalling that we also use a(1) = x
to also denote the values from the input layer, then given layer l’s activations a(l),
we can compute layer (l + 1) s activations a (l+1) as
Compared with the traditional MLP, the new neural networks are generally deeper,
and it is more difficult to optimize these neural networks by backpropagation. Thus,
many new techniques have been proposed to smooth the network training, such as
batch normalization (BN) and batch Kalman normalization [1].
where the expectation and variance are computed over the training dataset.
Then, for each activation x (k) , a pair of parameters γ (k) , β (k) are introduced to
scale and shift the normalized value as
1 k
m
uB ← xi
m
i=1
1
m
σB
2
← (xi − u B )
m
i=1
xi − u B
xˆi ←
σB
2 +
B = x1...m . (1.10)
Let the normalized values be x̂1...m , and let their linear transformation be y1...m . We
refer to the transform
B Nr,β : x1...m → y1...m (1.11)
(a) (b)
Fig. 1.3 a illustrates the distribution estimation in the conventional batch normalization (BN),
where the minibatch statistics, μk and k , are estimated based on the currently observed minibatch
at the kth layer. For clarity of notation, μk and k indicate the mean and the covariance matrix,
respectively. Note that only the diagonal entries are used in normalization. X and X represent the
internal representation before and after normalization. In b, batch Kalman normalization (BKN)
provides a more accurate distribution estimation of the kth layer by aggregating the statistics of the
preceding (k-1)th layer
Fig. 1.4 Illustration of the proposed batch Kalman normalization (BKN). At the (k-1)th layer
of a DNN, BKN first estimates its statistics (means and covariances), μ̂k−1|k−1 , and ˆ k−1|k−1 .
Additionally, the estimations in the kth layer are based on the estimations of the (k-1)th layer,
where these estimations are updated by combining with the observed statistics of the kth layer. This
process treats the entire DNN as a whole system, in contrast to existing works that estimated the
statistics of each hidden layer independently
variances) with a maximum number of 2048 dimensions into a new state vector and
then combining with the current observations (Fig. 1.4).
Let x k be the feature vector of a hidden neuron in the kth hidden layer of a DNN, such
as a pixel in the hidden convolutional layer of a CNN. BN normalizes the values of x k
by using a minibatch of m samples, B = {x1k , x2k , ..., xmk }. The mean and covariance
of x k are approximated by
1 k
m
Sk ← (x − x̄ k )(xik − x̄ k )T (1.12)
m i=1 i
1.2 New Techniques in Deep Learning 11
and
1 k
m
x̄ k ← x . (1.13)
m i=1 i
x −x̄
k k
We have x̂ k ← √ i , where diag(·) denotes the diagonal entries of a matrix,
diag(S k )
k
i.e., the variances of x . Then, the normalized representation is scaled and shifted
to preserve the modeling capacity of the network, y k ← γ x̂ k + β, where γ and β
are parameters that are optimized during training. However, a minibatch with a
moderately large size is required to estimate the statistics in BN. It is compelling to
explore better estimations of the distribution in a DNN to accelerate training. Assume
that the true values of the hidden neurons in the kth layer can be represented by the
variable x k , which is approximated by using the values in the previous layer x k−1 .
We have
x k = Ak x k−1 + u k , (1.14)
where Ak is a state transition matrix that transforms the states (features) in the
previous layer to the current layer. Additionally, u k is a bias that follows a Gaussian
distribution with zero mean and unit variance. Note that Ak could be a linear transition
between layers. This is reasonable because our purpose is not to accurately compute
the hidden features in a certain layer given those in the previous layer but rather to
draw a connection between layers to estimate the statistics.
As the above true values of x k exist but are not directly accessible, they can be
measured by the observation z k with a bias term vk :
z k = x k + vk , (1.15)
where z k indicates the observed values of the features in a minibatch. In other words,
to estimate the statistics of x k , previous studies only consider the observed value
of z k in a minibatch. BKN takes the features in the previous layer into account.
To this end, we compute the expectation on both sides of Eq. (1.14), i.e., E[x k ] =
E[Ak x k−1 + u k ], and have
μ̂k|k−1 = Ak μ̂k−1|k−1 , (1.16)
where μ̂k−1|k−1 denotes the estimation of the mean in the (k-1)th layer, and μ̂k|k−1
is the estimation of the mean in the kth layer conditioned on the previous layer. We
call μ̂k|k−1 an intermediate estimation of the layer k because it is then combined
with the observed values to achieve the final estimation. As shown in Eq. (1.17),
the estimation in the current layer μ̂k|k is computed by combining the intermediate
estimation with a bias term, which represents the error between the observed values
z k and μ̂k|k−1 . Here, z k indicates the observed mean values, and we have z k = x k .
Additionally, q k is a gain value indicating how much we reply on this bias.
where ˆ k|k denote the intermediate and the final estimations of the covari-
ˆ k|k−1 and
ance matrices in the kth layer, respectively. R is the covariance matrix of the bias u k
in Eq. (1.14). Note that it is identical for all the layers. S k are the observed covariance
matrices of the minibatch in the kth layer. In Eq. (1.18), the transition matrix Ak , the
covariance matrix R, and the gain value q k are parameters that are optimized during
training. In BKN, we employ μ̂k|k and ˆ k|k to normalize the hidden representation.
Please reference the 2 for the detail of Batch Kalman Normalization.
References 13
References
As one key step toward many subsequent face-related applications, face detection
has been extensively studied in the computer vision literature. Early efforts in face
detection date back to as early as the beginning of the 1970s, where simple heuristic
and anthropometric techniques [1] were used. Prior to 2000, despite progress [2, 3],
the practical performance of face detection was far from satisfactory. One genuine
breakthrough was the Viola-Jones framework [4], which applied rectangular Haar-
like features in a cascaded AdaBoost classifier to achieve real-time face detection.
However, this framework has several critical drawbacks. First, its feature size was
relatively large. Typically, in a 24 × 24 detection window, the number of Haar-
like features was 160 thousand [5]. Second, this framework is not able to effectively
handle non-frontal faces in the wild. Many works have been proposed to address these
issues of the Viola-Jones framework and achieve further improvements. First, more
complicated features (such as HOG [6], SIFT [7], SURF [8]) were used. For example,
Liao et al. [9] proposed a new image feature called normalized pixel difference
(NPD), which is computed as the difference to sum ratio between two pixel values.
Second, to detect faces with various poses, some works combined multiple detectors,
each of which was trained for a specific view. As a representative work, Zhu et al.
[10] applied multiple deformable part models to capture faces with different views
and expressions.
Recent years have witnessed advances in face detection using deep learning meth-
ods, which significantly outperform traditional computer vision methods. For exam-
ple, Li et al. [11] proposed a cascade architecture built on CNNs, which can quickly
reject the background regions in the fast low-resolution stage and effectively calibrate
the bounding boxes of face proposal in the high-resolution stages. Following a similar
Facial landmark localization has long been attempted in computer vision, and a
large number of approaches have been proposed for this purpose. The conventional
approaches for this task can be divided into two categories: template fitting methods
and regression-based methods.
Template fitting methods build face templates to fit the input face appearance. A
representative work is the active appearance model (AAM) [19], which attempts
to estimate model parameters by minimizing the residuals between the holistic
appearance and an appearance model. Rather than using holistic representations,
a constrained local model (CLM) [20] learns an independent local detector for each
facial keypoint and a shape model for capturing valid facial deformations. Improved
versions of CLM primarily differ from each other in terms of local detectors. For
instance, Belhumeur et al. [21] detected facial landmarks by employing SIFT fea-
tures and SVM classifiers, and Liang et al. [22] applied AdaBoost to the HAAR
wavelet features. These methods are generally superior to the holistic methods due
to the robustness of patch detectors against illumination variations and occlusions.
Regression-based facial landmark localization methods can be further divided
into direct mapping techniques and cascaded regression models. The former directly
maps local or global facial appearances to landmark locations. For example, Dantone
et al. [23] estimated the absolute coordinates of facial landmarks directly from an
ensemble of conditional regression trees trained on facial appearances. Valstar et al.
[24] applied boosted regression to map the appearances of local image patches to the
positions of corresponding facial landmarks. Cascaded regression models [25–31]
formulate shape estimation as a regression problem and make predictions in a cas-
2.2 Facial Landmark Localization 17
caded manner. These models typically start from an initial face shape and iteratively
refine the shape according to learned regressors, which map local appearance fea-
tures to incremental shape adjustments until convergence is achieved. Cao et al. [25]
trained a cascaded nonlinear regression model to infer an entire facial shape from
an input image using pairwise pixel-difference features. Burgos–Artizzu et al. [32]
proposed a novel cascaded regression model for estimating both landmark positions
and their occlusions using robust shape-indexed features. Another seminal method is
the supervised descent method (SDM) [27], which uses SIFT features extracted from
around the current shape and minimizes a nonlinear least-squares objective using the
learned descent directions. All these methods assume that an initial shape is given in
some form, e.g., a mean shape [27, 28]. However, this assumption is too strict and
may lead to poor performance on faces with large pose variations.
INRIA [55] was released in 2005, containing 1805 images of humans cropped from
a varied set of personal photos. ETH [56] was collected through strolls through busy
shopping streets. Daimler [57] contains pedestrians that are fully visible in an upright
position. TUD [58] was developed for many tasks, including pedestrian detection.
Positive samples of the training set were collected in a busy pedestrian zone with
a handheld camera, including not only upright standing pedestrians but also side
standing ones. Negative samples of the training set were collected in an inner city
district and also from vehicle driving videos. The test set is collected in the inner
city of Brussels from a driving car. All pedestrians are annotated. KITTI [59] was
collected by four high-resolution video cameras, and up to 15 cars and 30 pedestrians
are visible per image. Caltech [60] is the largest pedestrian dataset to date, collecting
10 h of vehicle driving video in an urban scenario. This dataset includes pedestrians
in different scales and positions, and various degrees of occlusions are also included.
2.3 Pedestrian Detection 19
The existing methods can be divided into two categories: one is handcrafted features
followed by a classical classifier, and the other is deep learning methods.
Early approaches typically consist of two separate stages: feature extraction and
binary classification. Candidate bounding boxes are generated by sliding-window
methods. Classic HOG [55] proposed using histogram of oriented gradients as fea-
tures and a linear support vector machine as the classifier. Following this framework,
various feature descriptors and classifiers were proposed. Typical classifiers include
nonlinear SVM and AdaBoost. HIKSVM [61] proposed using histogram intersec-
tion kernel SVM, which is a nonlinear SVM. RandForest [62] used a random forest
ensemble, rather than SVM, as the classifier. For various feature descriptors, ICF
[63] generalized several basic features to multiple channel features by computations
of linear filters, nonlinear transformations, pointwise transformations, integral his-
togram, and gradient histogram. Integral images are used to obtain the final features.
Features are learned by the boosting algorithm, while decision trees are employed as
the weak classifier. SCF [64] inherited the main idea of ICF, but it proposed a revision
with insights. Rather than using regular cells as the classic HOG method does, SCF
attempts to learn an irregular pattern of cells. The feature pool consists of squares
in detection windows. ACF [65] attempted to accelerate pyramid feature learning
though the aggregation of channel features. Additionally, it learns by AdaBoost [66],
whose base classifier is deep tree. LDCF [67] proposed a local decorrelation trans-
formation. SpatialPooling [68] was built based on ACF [65]. Spatial pooling is used
to compute the covariance descriptor and local binary pattern descriptor, enhancing
the robustness to noise and transformation. Features are learned by structural SVM.
[69] explored several types of filters, and a checkerboard filter achieved the best per-
formance. Deformable part models (DPMs) have been widely used for solving the
occlusion issue. [70] first proposed deformable parts filters, which are placed near
the bottom level of the HOG feature pyramid. A multiresolution model was proposed
by [71] as a DPM. [72] used DPM for multi-pedestrian detection and proved that
DPM can be flexibly incorporated with other descriptors such as HOG. [73] designed
a multitask form of DPM that captures the similarities and differences of samples.
DBN-Isol [74] proposed a discriminative deep model for learning the correlations of
deformable parts. In [75], a parts model was embedded into a designed deep model.
Sermanet et al. [76] first used a deep convolutional architecture. Reference [76]
designed a multiscale convolutional network composed of two stages of convolu-
20 2 Human-Centric Visual Analysis: Tasks and Progress
tional layers for feature extraction, which is followed by a classifier. The model is
first trained with unsupervised learning layer by layer and then using supervised
learning with a classifier for label prediction. Unlike previous approaches, this con-
volutional network performs end-to-end training, whose features are all learned from
the input data. Moreover, bootstrapping is used for relieving the imbalance between
positive and negative samples. JointDeep [77] designed a deep convolutional net-
work. Each of the convolutional layers in the proposed deep network is responsible
for a specific task, while the whole network is able to learn feature extraction, defor-
mation issues, occlusion issues, and classification jointly. MultiSDP [78] proposed
a multistage contextual deep model simulating the cascaded classifiers. Rather than
training sequentially, the cascaded classifiers in the deep model can be trained jointly
using backpropagation. SDN [79] proposed a switchable restricted Boltzmann for
better detection in cluttered background and variably presented pedestrians.
Driven by the success of (“slow”) R-CNN [80] for general object detection, a
recent series of methods have adopted a two-stage pipeline for pedestrian detection.
These methods first use proposal methods to predict candidate detection bounding
boxes, generally a large amount. These candidate boxes are then fed into a CNN for
feature learning and class prediction. In the task of pedestrian detection, the proposal
methods used are generally standalone pedestrian detectors consisting of handcrafted
features and boosted classifiers.
Reference [81] used SquaresChnFtrs [64] as proposal methods, which are fed
into a CNN for classification. In this paper, two CNNs with different scales were
tried, which are CifarNet [82] and AlexNet [83]. The methods were evaluated on the
Caltech [60] and KITTI [59] datasets. The performance was on par with the state of
the art at that time but is not yet able to surpass some of the handcrafted methods
due to the design of CNN and lack of parts or occlusion modeling.
TA-CNN [84] employed the ACF detector [65], incorporating with semantic infor-
mation, to generate proposals. The CNN used was revised from AlexNet [83]. This
method attempted to improve the model effects by relieving the confusion between
positive samples and hard negative ones. The method was evaluated on the Caltech
[60] and ETH [56] datasets, and it surpassed state-of-the-art methods.
DeepParts [85] applied the LDCF [67] detector to generate proposals and learned
a set of complementary parts by neural networks, improving occlusion detection.
They first constructed a part pool covering all positions and ratios of body parts,
and they automatically chose appropriate parts for part detection. Subsequently, the
model learned a part detector for each body part without using part annotations. These
part detectors are independent CNN classifiers, one for each body part. Furthermore,
proposal shifting problems were handled. Finally, full-body scores were inferred,
and pedestrian detection was fulfilled.
SAF R-CNN [86] implemented an intuitive revision of this R-CNN two-stage
approach. They used the ACF detector [65] for proposal generation. The proposals
were fed into a CNN, and they were soon separated into two branches of subnetwork,
driven by a scale-aware weighting layer. Each of the subnetworks is a popular Fast
R-CNN [15] framework. This approach improved small-size pedestrian detection.
2.3 Pedestrian Detection 21
Unlike the above R-CNN-based methods, the CompACT method [87] obtained
both handcrafted features and deep convolutional features, and on top of which it
learned boosted classifiers. A complexity-aware cascade boosting algorithm was
used such that features of various complexities are able to be integrated into one
single model.
CCF detector [88] is a boosted classifier on pyramids of deep convolutional fea-
tures, but it uses no region proposals. Rather than using deep convolutional network
as feature learner and predictor as mentioned methods do, this method utilized the
deep convolutional network as the first step image feature extractor.
The goal of human parsing is to partition the human body into different semantic
parts, such as hair, head, torso, arms, legs, and so forth, which provides rich descrip-
tions for human-centric analysis and thus becomes increasingly important for many
computer vision applications, including content-based image/video retrieval, person
re-identification, video surveillance, action recognition and clothes fashion recogni-
tion. However, it is very challenging in real-life scenarios due to the variability in
human appearances and shapes caused by the large numbers of human poses, clothes
types, and occlusion/self-occlusion patterns.
Part segment proposal generation. Previous works generally adopt low-level
segment-based proposal. For example, some approaches take higher level cues. Bo
and Fowlkes exploited roughly learned part location priors and part mean shape infor-
mation, and they derived a number of part segments from the gPb-UCM method using
a constrained region merging method. Dong et al. employed the Parselets for pro-
posal to obtain mid-level part semantic information for the proposal. However, either
low-level, mid-level ,or rough location proposals may result in many false positives,
misleading the later process.
References
7. P.C. Ng, S. Henikoff, Sift: Predicting amino acid changes that affect protein function. Nucleic
acids research 31(13), 3812–3814 (2003)
8. Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, J. R. Smith, Learning locally-adaptive decision
functions for person verification, in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 3610–3617 (2013)
9. S. Liao, A.K. Jain, S.Z. Li, A fast and accurate unconstrained face detector. IEEE transactions
on pattern analysis and machine intelligence 38(2), 211–223 (2016)
10. X. Zhu, D. Ramanan, Face detection, pose estimation, and landmark localization in the wild,
in CVPR. IEEE, pp. 2879–2886 (2012)
11. H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face
detection, in CVPR, pp. 5325–5334 (2015)
12. K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using multitask cascaded
convolutional networks. IEEE Signal Processing Letters 23(10), 1499–1503 (2016)
13. Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, X. Hu, Scale-aware face detection, in CVPR, vol. 3 (2017)
14. Y. Liu, H. Li, J. Yan, F. Wei, X. Wang, X. Tang, Recurrent scale approximation for object
detection in cnn, in ICCV, vol. 5 (2017)
15. R. Girshick, Fast r-cnn, in Proceedings of the IEEE International Conference on Computer
Vision, pp. 1440–1448 (2015)
16. S. Wan, Z. Chen, T. Zhang, B. Zhang, K.-k. Wong, Bootstrapping face detection with hard
negative examples, arXiv preprint arXiv:1608.02236 (2016)
17. V. Jain, E. Learned-Miller, Fddb: a benchmark for face detection in unconstrained settings,
Technical Report UM-CS-2010-009, University of Massachusetts, Amherst (Tech, Rep, 2010)
18. Y. Bai, Y. Zhang, M. Ding, B. Ghanem, Finding tiny faces in the wild with generative adversarial
network, inCVPR (2018)
19. T.F. Cootes, G.J. Edwards, C.J. Taylor, Active appearance models. PAMI 6, 681–685 (2001)
20. J.M. Saragih, S. Lucey, J.F. Cohn, Deformable model fitting by regularized landmark mean-
shift. IJCV 91(2), 200–215 (2011)
21. P.N. Belhumeur, D.W. Jacobs, D.J. Kriegman, N. Kumar, Localizing parts of faces using a
consensus of exemplars. PAMI 35(12), 2930–2940 (2013)
22. L. Liang, R. Xiao, F. Wen, J. Sun, Face alignment via component-based discriminative search,
in ECCV (Springer, 2008), pp. 72–85
23. M. Dantone, J. Gall, G. Fanelli, L. Van Gool, Real-time facial feature detection using conditional
regression forests, in CVPR (IEEE, 2012), pp. 2578–2585
24. M. Valstar, B. Martinez, X. Binefa, M. Pantic, Facial point detection using boosted regression
and graph models, in CVPR (IEEE, 2010), pp. 2729–2736
25. X. Cao, Y. Wei, F. Wen, J. Sun, Face alignment by explicit shape regression. IJCV 107(2),
177–190 (2014)
26. V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees,
in CVPR, pp. 1867–1874 (2014)
27. X. Xiong, F. Torre, Supervised descent method and its applications to face alignment, in CVPR,
pp. 532–539 (2013)
28. S. Ren, X. Cao, Y. Wei, J. Sun, Face alignment at 3000 fps via regressing local binary features,
in CVPR, pp. 1685–1692 (2014)
29. S. Zhu, C. Li, C.-C. Loy, X. Tang, Unconstrained face alignment via cascaded compositional
learning, in CVPR, pp. 3409–3417 (2016)
30. O. Tuzel, T. K. Marks, S. Tambe, Robust face alignment using a mixture of invariant experts,
in ECCV (Springer, 2016), pp. 825–841
31. X. Fan, R. Liu, Z. Luo, Y. Li, Y. Feng, Explicit shape regression with characteristic number for
facial landmark localization, TMM (2017)
32. X. Burgos-Artizzu, P. Perona, P. Dollár, Robust face landmark estimation under occlusion, in
ICCV, pp. 1513–1520 (2013)
33. E. Zhou, H. Fan, Z. Cao, Y. Jiang, Q. Yin, Extensive facial landmark localization with coarse-
to-fine convolutional network cascade, in ICCV Workshops, pp. 386–391 (2013)
References 23
34. Z. Zhang, P. Luo, C.C. Loy, X. Tang, Facial landmark detection by deep multi-task learning,
in ECCV (Springer, 2014), pp. 94–108
35. H. Liu, D. Kong, S. Wang, B. Yin, Sparse pose regression via componentwise clustering feature
point representation. TMM 18(7), 1233–1244 (2016)
36. T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, K. Yan, A deep neural network-driven feature
learning method for multi-view facial expression recognition. TMM 18(12), 2528–2536 (2016)
37. J. Zhang, S. Shan, M. Kan, X. Chen, Coarse-to-fine auto-encoder networks (cfan) for real-time
face alignment, in ECCV (Springer, 2014), pp. 1–16
38. J. Zhang, M. Kan, S. Shan, X. Chen, Occlusion-free face alignment: deep regression networks
coupled with de-corrupt autoencoders, in CVPR, pp. 3428–3437 (2016)
39. H. Lai, S. Xiao, Z. Cui, Y. Pan, C. Xu, S. Yan, Deep cascaded regression for face alignment,
arXiv preprint arXiv:1510.09083 (2015)
40. D. Merget, M. Rock, G. Rigoll, Robust facial landmark detection via a fully-convolutional
local-global context network, in CVPR, pp. 781–790 (2018)
41. A. Bulat and G. Tzimiropoulos, Super-fan: Integrated facial landmark localization and super-
resolution of real-world low resolution faces in arbitrary poses with gans, in CVPR (2018)
42. Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, D. Metaxas, Quantized densely connected u-nets
for efficient landmark localization, in ECCV (2018)
43. X. Peng, R.S. Feris, X. Wang, D.N. Metaxas, A recurrent encoder-decoder network for sequen-
tial face alignment, in ECCV (Springer, 2016), pp. 38–56
44. S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, A. Kassim, Robust facial landmark detection via
recurrent attentive-refinement networks, in ECCV (Springer, 2016), pp. 57–72
45. G. Trigeorgis, P. Snape, M.A. Nicolaou, E. Antonakos, S. Zafeiriou, Mnemonic descent method:
a recurrent process applied for end-to-end face alignment, in CVPR, pp. 4177–4187 (2016)
46. X. Zhu, Z. Lei, X. Liu, H. Shi, S. Z. Li, Face alignment across large poses: a 3d solution, in
CVPR, pp. 146–155 (2016)
47. A. Jourabloo, X. Liu, Large-pose face alignment via cnn-based dense 3d model fitting, in
CVPR, pp. 4188–4196 (2016)
48. F. Liu, D. Zeng, Q. Zhao, X. Liu, Joint face alignment and 3d face reconstruction, in ECCV
(Springer, 2016), pp. 545–560
49. A. Bulat, G. Tzimiropoulos, How far are we from solving the 2d & 3d face alignment problem?
(and a dataset of 230,000 3d facial landmarks, in CVPR, vol. 1, no. 2, p. 4 (2017)
50. Y. Feng, F. Wu, X. Shao, Y. Wang, X. Zhou, Joint 3d face reconstruction and dense alignment
with position map regression network, in ECCV (2018)
51. X. Dong, S.-I. Yu, X. Weng, S.-E. Wei, Y. Yang, Y. Sheikh, Supervision-by-registration: an
unsupervised approach to improve the precision of facial landmark detectors, in CVPR, pp.
360–368 (2018)
52. Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, H. Lee, Unsupervised discovery of object landmarks
as structural representations, in CVPR (2018)
53. X. Dong, Y. Yan, W. Ouyang, Y. Yang, Style aggregated network for facial landmark detection,
in CVPR, vol. 2, p. 6 (2018)
54. S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, J. Kautz, Improving landmark localization
with semi-supervised learning, in CVPR (2018)
55. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (2005)
56. B.L. Andreas Ess, L. Van Gool, Depth and appearance for mobile scene analysis, in IEEE
International Conference on Computer Vision (ICCV) (2007)
57. M. Enzweiler, D.M. Gavrila, Monocular pedestrian detection: Survey and experiments. IEEE
Trans. Pattern Anal. Mach. Intell. 12, 2179–2195 (2008)
58. C. Wojek, S. Walk, B. Schiele, Multi-cue onboard pedestrian detection (2009)
59. A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark
suite, in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (IEEE,
2012), pp. 3354–3361
24 2 Human-Centric Visual Analysis: Tasks and Progress
60. B. Schiele Piotr Dollár, C. Wojek, P. Perona, Pedestrian detection: an evaluation of the state
of the art (2012)
61. S. Maji, A.C. Berg, J. Malik, Classification using intersection kernel support vector machines
is efficient, in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference
on, pp. 1–8. IEEE (2008)
62. J. Marin, D. Vázquez, A.M. López, J. Amores, B. Leibe, Random forests of local experts for
pedestrian detection, in Proceedings of the IEEE International Conference on Computer Vision,
pp. 2592–2599 (2013)
63. P.P. Piotr Dollár, Z. Tu, S. Belongie, Integral channel features, in British Machine Vision
Conference (BMVC) (2009)
64. R. Benenson, M. Mathias, T. Tuytelaars, L. Van Gool, Seeking the strongest rigid detector,
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
3666–3673 (2013)
65. S.B. Piotr Dollár, R. Appel, P. Perona, Fast feature pyramids for object detection (2014)
66. R. Tibshirani-et al. J. Friedman, T. Hastie, Additive logistic regression: a statistical view of
boosting, in The Annals of Statistics (2000)
67. W. Nam, P. Dollár, J.H. Han, Local decorrelation for improved pedestrian detection, in Advances
in Neural Information Processing Systems, pp. 424–432 (2014)
68. S. Paisitkriangkrai, C. Shen, A. Van Den Hengel, Strengthening the effectiveness of pedestrian
detection with spatially pooled features, in European Conference on Computer Vision (Springer,
2014), pp. 546–561
69. S. Zhang, R. Benenson, B. Schiele, et al., Filtered channel features for pedestrian detection, in
CVPR, volume 1, p. 4 (2015)
70. P. Felzenszwalb, D. McAllester, D. Ramanan. A discriminatively trained, multiscale,
deformable part model, in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE
Conference on (IEEE, 2008), pp. 1–8
71. D. Park, D. Ramanan, C. Fowlkes, Multiresolution models for object detection, in European
Conference on Computer Vision (Springer, 2010), pp. 241–254
72. W. Ouyang, X. Wang, Single-pedestrian detection aided by multi-pedestrian detection, in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3198–3205
(2013)
73. J. Yan, X. Zhang, Z. Lei, S. Liao, S.Z. Li, Robust multi-resolution pedestrian detection in traffic
scenes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 3033–3040 (2013)
74. X. Wang, W. Ouyang, A discriminative deep model for pedestrian detection with occlusion
handling, in 2012 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2012),
pp. 3258–3265
75. W. Ouyang, X. Zeng, X. Wang, Modeling mutual visibility relationship in pedestrian detection,
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3222–
3229 (2013)
76. P. Sermanet, K. Kavukcuoglu, S. Chintala, Y. LeCun, Pedestrian detection with unsupervised
multi-stage feature learning, in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 3626–3633 (2013)
77. W. Ouyang, X. Wang, Joint deep learning for pedestrian detection, in Proceedings of the IEEE
International Conference on Computer Vision, pp. 2056–2063 (2013)
78. X. Zeng, W. Ouyang, X. Wang, Multi-stage contextual deep learning for pedestrian detection,
in Proceedings of the IEEE International Conference on Computer Vision, pp. 121–128 (2013)
79. P. Luo, Y. Tian, X. Wang, X. Tang, Switchable deep network for pedestrian detection, in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 899–
906 (2014)
80. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object
detection and semantic segmentation, in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 580–587 (2014)
References 25
81. J. Hosang, M. Omran, R. Benenson, B. Schiele, Taking a deeper look at pedestrians, in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4073–4082
(2015)
82. A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images (Technical
report, Citeseer, 2009)
83. A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional
neural networks, in Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
84. X. Wang, Y. Tian, P. Luo, X. Tang, Pedestrian detection aided by deep learning semantic tasks,
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
85. X. Wang, Y. Tian, P. Luo, X. Tang, Deep learning strong parts for pedestrian detection, in
IEEE International Conference on Computer Vision (ICCV) (2015)
86. Jianan Li, Xiaodan Liang, ShengMei Shen, Xu Tingfa, Jiashi Feng, Shuicheng Yan, Scale-
aware fast r-cnn for pedestrian detection. IEEE Transactions on Multimedia 20(4), 985–996
(2018)
87. M. Saberian, Z. Cai, N. Vasconcelos, Learning complexity-aware cascades for deep pedestrian
detection, in IEEE International Conference on Computer Vision (ICCV) (2015)
88. B. Yang, J. Yan, Z. Lei, S.Z. Li, Convolutional channel features, in ICCV, pp. 82–90 (2015)
Part II
Localizing Persons in Images
References
1. T. Chen, L. Lin, L. Liu, X. Luo, X. Li, Disc: deep image saliency computing via
progressive representation learning. TNNLS 27(6), 1135–1149 (2016)
2. L. Liu, H. Wang, G. Li, W. Ouyang, L. Lin, Crowd counting using deep recurrent
spatial-aware network, in IJCAI (2018)
3. L. Liu, R. Zhang, J. Peng, G. Li, B. Du, L. Lin, Attentive crowd flow machines,
in ACM MM (ACM, 2018), pp. 1553–1561
4. Z. Zhang, P. Luo, C.C. Loy, X. Tang, Facial landmark detection by deep multi-task
learning, in ECCV (Springer, 2014), pp. 94–108
5. Y. Sun, X. Wang, X. Tang, Deep convolutional network cascade for facial point
detection, in CVPR, pp. 3476–3483 (2013)
6. R. Weng, J. Lu, Y.-P. Tan, J. Zhou, Learning cascaded deep auto-encoder networks
for face alignment, TMM, vol. 18, no. 10, pp. 2066–2078 (2016)
7. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic
segmentation, in CVPR, pp. 3431–3440 (2015)
8. P. Perona, P. Dollár, Z. Tu, S. Belongie, Integral channel features, in British
Machine Vision Conference (BMVC) (2009)
9. S. Belongie, P. Dollár, R. Appel, P. Perona, Fast feature pyramids for object
detection (2014)
10. A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep
convolutional neural networks, in Advances in Neural Information Processing
Systems, pp. 1097–1105 (2012)
11. KarVery deep convolutional networks for large-scale image recognitionen
Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scale image recognition. In arXiv:1409.1556 (2014)
12. J. Hosang, M. Omran, R. Benenson, B. Schiele, Taking a deeper look at pedes-
trians, in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 4073–4082 (2015)
13. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accu-
rate object detection and semantic segmentation, in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 580–587 (2014)
Chapter 3
Face Localization and Enhancement
Abstract Facial landmark localization plays a critical role in facial recognition and
analysis. In this chapter, we first propose a novel cascaded backbone-branches fully
convolutional neural network (BB-FCN) for rapidly and accurately localizing facial
landmarks in unconstrained and cluttered settings. The proposed BB-FCN generates
facial landmark response maps directly from raw images without any preprocessing.
It follows a coarse-to-fine cascaded pipeline, which consists of a backbone network
for roughly detecting the locations of all facial landmarks and one branch network
for each type of detected landmark to further refine their locations (c
[2019] IEEE.
Reprinted, with permission, from [1].). At the end of this chapter, we also introduce
the progress in face hallucination, a fundamental problem in the face analysis field
that refers to generating a high-resolution facial image from a low-resolution input
image ( c
[2019] IEEE. Reprinted, with permission, from [2].).
Facial landmark localization aims to automatically predict the key point positions
in facial image regions. This task is an essential component in many face-related
applications, such as facial attribute analysis [3], facial verification [4, 5], and facial
recognition [6–8]. Although tremendous effort has been devoted to this topic, the
performance of facial landmark localization is still far from perfect, particularly in
facial regions with severe occlusions or extreme head poses.
Most of the existing approaches to facial landmark localization have been devel-
oped for a controlled setting, e.g., the facial regions are detected in a preprocessing
step. This setting has drawbacks when working with images taken in the wild (e.g.,
cluttered surveillance scenes), where automated face detection is not always reliable.
The objective of this work is to propose an effective and efficient facial landmark
localization method that is capable of handling images taken in unconstrained set-
tings that contain multiple faces, extreme head poses, and occlusions (see Fig. 3.1).
Specifically, we focus on the following issues when developing our algorithm.
• Faces may have great variations in appearance and structure in unconstrained set-
tings due to diverse viewing conditions, rich facial expressions, large pose changes,
Fig. 3.1 Facial landmark localization in unconstrained settings. First row: Two cluttered images
with an unknown number of faces; second row: Dense response maps generated by our method
facial accessories (e.g., glasses and hats), and aging. Therefore, traditional global
models may not work well because the usual assumptions (e.g., certain spatial
layouts) may not hold in such environments.
• Boosted-cascade-based fast face detectors, which evolved from the seminal work
of Viola and Jones [9], can work well only for near-frontal faces under normal
conditions. Although accurate deformable part-based models [10] can perform
much better on challenging datasets, these models are slow due to their high
complexity. Detection in an image takes a few seconds, which makes such detectors
impractical for our task.
In this section, we formulate facial landmark localization as a pixel-labeling prob-
lem and devise a fully convolutional neural network (FCN) to overcome the afore-
mentioned issues. The proposed approach produces facial landmark response maps
directly from raw images without relying on any preprocessing or feature engineer-
ing. Two typical landmark response maps generated with our method are shown in
Fig. 3.1.
Considering both computational efficiency and localization accuracy, we pose
facial landmark localization as a cascaded filtering process. In particular, the locations
of facial landmarks are first roughly detected in a global context and then refined
by observing local regions. To this end, we introduce a novel FCN architecture that
naturally follows this coarse-to-fine pipeline. Specifically, our architecture contains
one backbone network and several branches, with each branch corresponding to
3.1 Facial Landmark Machines 31
one landmark type. For computational efficiency, the backbone network is designed
to be an FCN with lightweight filters, which takes a low-resolution image as its
input and rapidly generates an initial multichannel heat map, with each channel
predicting the location of a specific landmark. We can obtain landmark proposals
from each channel of the initial heat map. We can then crop a region centered at
every landmark proposal from both the original input image and the corresponding
channel of the response map. These cropped regions are stacked and fed to a branch
network for fine and accurate localization. Because fully connected layers are not
used in either network, we call our architecture a cascaded backbone-branches fully
convolutional network (BB-FCN). Due to the tailored architecture of the backbone
network, which can reject most background regions and retain high-quality landmark
proposals, the BB-FCN is also capable of accurately localizing the landmarks of faces
on various scales by rapidly scanning every level of the constructed image pyramid.
Furthermore, we have discovered that our landmark localization results can help
generate fewer and higher quality face proposals, thus enhancing the accuracy and
efficiency of face detection.
Given an unconstrained image I with an unknown number of faces, our facial land-
mark localization method aims to locate all facial landmarks in the image. We use
L ik = (xik , yik ) to denote the location of the ith landmark of type k in image I , where
xik and yik represent the coordinates of this landmark. Then, our task is to obtain the
complete set of landmarks in I ,
where k = 1, 2, ..., K . When describing our method and analyzing the proposed
network, we set K = 5 as an example, but our method is also applicable to any other
values of K . Here, the five landmark types are the left eye (LE), right eye (RE), nose
(N), left mouth corner (LM), and right mouth corner (RM).
In contrast to existing approaches that predict landmark locations by coordinate
regression, we exploit fully convolutional neural networks (FCNs) to directly produce
response maps that indicate the probability of landmark existence at every image
location. In our method, the predicted value at each location of the response map
can be viewed as a series of filtering operations applied to a specific region of the
input image. The specific region is called the receptive field. An ideal series of filters
should have the following property: a receptive field with a landmark of a specific
type located at its center should return a strong response value, while receptive fields
without that type of landmark in the center should yield weak responses. Let FWk (P)
denote the result of applying a series of filtering functions with parameter setting Wk
for type-k landmarks to receptive field P, and it is defined as follows:
32 3 Face Localization and Enhancement
1 if P has a type-k landmark in the center;
FWk (P) = (3.2)
0 otherwise.
where I (P(x, y)) denotes the image patch corresponding to the receptive field of
location (x,y) in the output response map. If the response value is larger than a
threshold θ, a landmark of type k is detected at the center of the patch in image I .
According to Eq. (3.3), there is a trade-off between localization accuracy and
computational cost. To achieve high accuracy, we need to compute response values
for significantly overlapping receptive fields. However, to accelerate the detection
process, we should generate a coarser response map on receptive fields with less
overlap or from a lower resolution image. This motivates us to develop a cascaded
coarse-to-fine process to localize landmarks progressively, in a spirit similar to that
of the hierarchical deep networks in [11], for image classification. More specifically,
our network consists of two components. The first component generates a coarse
response map from a relatively low-resolution input, identifying rough landmark
locations. Then, the second component takes local patches centered at every estimated
landmark location and applies another filtering process to the local patches to obtain
a fine response map for accurate landmark localization.
In this section, this two-component architecture is implemented as a backbone-
branches fully convolutional neural network in which the backbone network gen-
erates coarse response maps for rough location inference, and the branch networks
produce fine response maps for accurate location refinement. Figure 3.2 shows the
architecture of our network.
Let a convolutional layer be denoted as C(n, h × w × ch) and a deconvolutional
layer be denoted as D(n, h × w × ch), where n represents the number of kernels and
h,w,ch represent the height, width, and number of channels of a kernel, respectively.
We also use M P to denote a max-pooling layer. In our network, the stride of all
convolutional layers is 1, and the stride of all deconvolutional layers is 2. The size
of the max-pooling operator is set to 2 × 2, and the stride is 2.
Backbone Network
20*32*32 15*32*32 5*32*32
3*32*32 30*16*16 30*16*16
40*8*8
LE
…
Guided
Crop
4*24*24 5*24*24 5*24*24 5*24*24 5*24*24 1*24*24
RM
Branch Network
Fig. 3.2 The main architecture of the proposed backbone-branches fully convolutional neural
network. This approach is capable of producing pixelwise facial landmark response maps in a
progressive manner. The backbone network first generates low-resolution response maps that iden-
tify approximate landmark locations via a fully convolutional network. The branch networks then
produce fine response maps over local regions for more accurate landmark localization. There are
K (e.g., K = 5) branches, each of which corresponds to one type of facial landmark and refines
the related response map. Only downsampling, upsampling, and prediction layers are shown, and
intermediate convolutional layers are omitted in the network branches
denote the predicted heat map of image I for the kth type of landmarks. The value of
H k (I ; Wc ) at position (x, y) can be computed with Eq. (3.3). We train the backbone
FCN using the following loss function:
K
L1 (I ; Wc ) = ||H k (I ; Wc ) − Hck (I )||2 , (3.4)
k=1
map. To make the branch network better suited for landmark position refinement, we
resize the original input image to 64 × 64, four times the size of the backbone input,
and at the same time zoom the heat map from the backbone network to 64 × 64.
The resolution of all the cropped patches is 24 × 24, and they are all centered at the
landmark position predicted by the backbone network. As shown in Fig. 3.2, each
branch is trained in the same way as the backbone network. We denote the parameters
of the branch component for type-k landmarks as Wkf and use H (P; Wkf ), H0k (P)
to denote the heat map that it generates and the corresponding ground truth heat map
of patch P, respectively. The loss function of this branch component is again defined
as follows:
L2 (P; Wkf ) = ||H (P; Wkf ) − H0k (P)||2 . (3.5)
(a) (b)
Fig. 3.3 a An isolated point cannot accurately reflect discrepancies among multiple annotations.
The three points near the right mouth corner were annotated by three different workers. b We label
a landmark as a small circular region rather than an isolated point in the ground truth heat map
3.2 The Cascaded BB-FCN Architecture 35
of an input image according to the annotated facial landmark locations. The most
straightforward method assigns “1” to a single pixel corresponding to each landmark
location and “0” to the remaining pixels. However, we argue that this method is
suboptimal because an isolated point cannot reflect discrepancies among multiple
annotations. As shown in Fig. 3.3a, the right mouth corner has three slightly different
locations marked by three annotators. To account for such discrepancies, we label
each landmark as a small region rather than an isolated point. We initialize the heat
map using zero everywhere, and then for each landmark p, we mark a circular region
with center p and radius R in the ground truth heat map with 1. Different radii are
adopted for the backbone network and branch networks, denoted as Rc and R f ,
respectively. R f is set to be smaller than Rc because the backbone network estimates
coarse landmark positions, while the branch networks predict accurate landmark
locations.
3.3.1 Datasets
To train our proposed BB-FCN, we collect 7317 facial images (6317 for training, 1000
for validation) from the Internet and collect 7542 natural images (6542 for training,
1000 for validation) with no faces from Pascal-VOC2012 as negative samples. Each
face is annotated with 72 landmarks. We use two challenging public datasets for
evaluation: AFW [10] and AFLW [12]. There is no overlap among the training,
validation, and evaluation datasets.
AFW: This dataset contains 205 images (468 faces) collected in the wild. Invisible
landmarks are not annotated, and each face is annotated with at most 6 landmarks.
This dataset is intended for use in testing facial keypoint detection in unconstrained
settings, meaning faces may exhibit large variations in pose, expression, and illumi-
nation and may have severe occlusions.
AFLW: This dataset contains 21,080 faces with large pose variations. It is highly
suitable for evaluating the performance of face alignment across a large range of
poses. The selection of testing images from AFLW follows [13], which randomly
chooses 3000 faces, 39% of which are nonfrontal.
To evaluate the accuracy of facial landmark localization, we adopt the mean (position)
error as the metric. For a specific type of landmark, the mean error is calculated as
the mean distance between the detected landmarks of the given type in all testing
36 3 Face Localization and Enhancement
images and their corresponding ground truth positions, normalized with respect to the
interocular distance. The (position) error of a single landmark is defined as follows:
(x − x )2 + (y − y )2
err = × 100%, (3.6)
l
where (x, y) and (x , y ) are the ground truth and detected landmark locations, respec-
tively, and the interocular distance l is the Euclidean distance between the center
points of the two eyes. In our experiments, we evaluate the mean error of every type
of facial landmark as well as the average mean error over all landmark types, i.e.,
LE (left eye), RE (right eye), N (nose), LM (left mouth corner), RM (right mouth
corner), and A (average mean error of the five facial landmarks).
The BB-FCN is capable of dealing with facial images taken in unconstrained settings;
e.g., the location of facial regions and the number of faces are unknown. We eval-
uate the performance of the BB-FCN using recall–error curves. A predictive facial
landmark is considered correct if there exists a ground truth landmark of the same
type within the given position error. For a fixed number of predictive landmarks, the
recall rate (the fraction of ground truth annotations covered by predictive landmarks)
varies as the acceptable position error increases; thus, a recall–error curve can be
obtained.
We evaluate the performance of the BB-FCN and the regression-based deep model
on the AFW dataset using an unconstrained setting. For faces with one or both eyes
invisible, the interocular distances are set at 41.9% of the length of their annotated
bounding boxes. The BB-FCN significantly outperforms the regression network, and
the complete BB-FCN model performs much better than the backbone network alone.
With a prediction of 15 landmarks for each landmark type, the complete model recalls
45% more landmarks than the regression network when the acceptable position error
is set within 8% of the interocular distance. As the number of landmark predictions of
each type increases to 30, the recall of five landmarks within a position error of 25%
of the interocular distance is 94.1, 95.7, 91.5, 95.8, and 95.2%. Given more predicted
landmarks, we can achieve higher landmark recollections. Figure 3.4 demonstrates
some landmark detection results on the AFW dataset in unconstrained settings.
We compare our method with other state-of-the-art methods, i.e., (1) robust cas-
caded pose regression (RCPR) [14]; (2) a tree structured part model (TSPM) [10];
(3) Luxand face SDK; (4) explicit shape regression (ESR) [15]; (5) a cascaded
3.3 Experimental Results 37
Fig. 3.4 Qualitative facial landmark detection results in unconstrained settings. The BB-FCN is
capable of dealing with unconstrained facial images, even though the location of facial regions and
the number of faces in the image are unknown. Best viewed in color
Fig. 3.5 Qualitative facial landmark localization results by our method. The first row shows the
results on the AFW dataset, and the second row shows the results on the AFLW dataset. Our method
is robust in conditions of occlusion, exaggerated expressions, and extreme illumination
deformable shape model (CDM) [16]; (6) the supervised descent method (SDM)
[17]; (7) a task-constrained deep convolutional network (TCDCN) [13]; (8) multi-
task cascaded convolutional networks (MTCNN) [18]; and (9) recurrent attentive-
refinement networks (RAR) [19]. The results of some competing methods are quoted
from [13].
On the AFW dataset, our average mean error over five landmark types is 6.18%,
which improves over the performance of the state-of-the-art TDCN by 24.6%. On
the AFLW dataset, the BB-FCN model achieves 6.28% average mean error, 21.5%
improvement over TDCN. The qualitative results in Fig. 3.5 show that our method is
robust in conditions of occlusion, exaggerated expressions, and extreme illumination.
Fig. 3.6 Sequentially discovering and enhancing facial parts in our attention-FH framework. At
each time step, our framework specifies an attended region based on past hallucination results and
enhances it by considering the global perspective of the whole face. The red solid bounding boxes
indicate the latest perceived patch in each step, and the blue dashed bounding boxes indicate all
the previously enhanced regions. We adopt a global reward at the end of the sequence to drive the
framework learning under the reinforcement learning paradigm
hallucination can facilitate several face-related tasks, such as face attribute recogni-
tion [20], face alignment [21], and facial recognition [22], in the complex real-world
scenarios in which facial images are often very low quality.
The existing face hallucination methods usually focus on how to learn a dis-
criminative patch-to-patch mapping from LR images to HR images. Particularly,
substantial recent progress has been made by employing advanced convolutional
neural networks (CNNs) [23] and multiple cascaded CNNs [24]. The face structure
priors and spatial configurations [25, 26] are often treated as external information
for enhancing faces and facial parts. However, the contextual dependencies among
the facial parts are usually ignored during hallucination processing. According to
studies of the human perception process [27], humans start by perceiving whole
images and successively explore a sequence of regions with the attention shifting
mechanism rather than separately processing the local regions. This finding inspires
us to explore a new pipeline of face hallucination by sequentially searching for the
attentional local regions and considering their contextual dependency from a global
perspective.
Inspired by the recent successes of attention and recurrent models in a variety
of computer vision tasks [28–30], we propose an attention-aware face hallucination
(attention-FH) framework that recurrently discovers facial parts and enhances them
by fully exploiting the global interdependency of the image, as shown in Fig. 3.6.
In particular, accounting for the diverse characteristics of facial images in terms
of blurriness, pose, illumination, and facial appearance, we search for an optimal
accommodated enhancement route for each face hallucination. We resort to the deep
reinforcement learning (RL) method [31] to harness the model learning because this
technique has been demonstrated to be effective in globally optimizing sequential
models without supervision for every step.
3.4 Attention-Aware Face Hallucination 39
Given a facial image Ilr with low resolution, our attention-FH framework targets the
corresponding high-resolution facial image Ihr by learning a projection function F:
lt = f π (st−1 ; θπ ),
(3.8)
Iˆt−1
lt
= g(lt , It−1 ),
where f π represents the recurrent policy network and θπ is the network parameters.
st−1 is the encoded input state of the recurrent policy network, which is constructed
by the input image It−1 and the encoded history action h t−1 . g denotes a cropping
40 3 Face Localization and Enhancement
operation that crops a fixed-size patch from It−1 at location lt as the selected facial
part. The patch size is set as 60 × 45 for all facial images.
We then enhance each local facial part Iˆt−1
lt
using our local enhancement network
ˆ l t
f e . The resulting enhanced local patch It is computed as
Iˆtlt = f e ( Iˆt−1
lt
, It−1 ; θe ), (3.9)
where θe is the local enhancement network parameters. The output image It at each
t-th step is therefore obtained by replacing the local patch of the input image It−1 at
location lt with the enhanced patch Iˆtlt . Our whole sequential attention-FH procedure
can be written as ⎧
⎪
⎨ I0 = Ilr
It = f (It−1 ; θ) 1 ≤ t ≤ T, (3.10)
⎪
⎩
Ihr = IT
where T is the maximal number of local patch mining steps, θ = [θπ ; θe ] and f =
[ f π ; f e ]. We set T = 25 empirically throughout this paper.
The recurrent policy network performs sequential local patch mining, which can be
treated as a decision-making process at discrete time intervals. At each time step,
the agent acts to determine an optimal image patch to be enhanced by conditioning
on the current state that it has reached. Given the selected location, the extracted
local patch is enhanced through the proposed local enhancement network. During
each time step, the state is updated by rendering the hallucinated facial image with
the enhanced facial part. The policy network recurrently selects and enhances local
patches until the maximum time step is achieved. At the end of this sequence, a
delayed global reward, which is measured by the mean squared error between the
final face hallucination result and the ground truth high-resolution image, is employed
to guide the policy learning of the agent. The agent can thus iterate to explore an
optimal search route for each individual facial image to maximize the global holistic
reward.
State: The state st at the tth step should be able to provide enough information for
the agent to make a decision without looking back more than one step. It is, therefore,
composed of two parts: (1) the enhanced hallucinated facial image It from previous
steps, which enables the agent to sense rich contextual information for processing a
new patch, e.g., the part that is still blurred and requires enhancement, and (2) the
latent variable h t , which is obtained by forwarding the encoded history action vector
h t−1 to the LSTM layer and is used to incorporate all previous actions. Therefore,
the goal of the agent is to determine the location of the next attended local patch by
sequentially observing state st = {It , h t } to generate a high-resolution image Ihr .
3.4 Attention-Aware Face Hallucination 41
Fig. 3.7 Network architecture of our recurrent policy network and local enhancement network.
At each time step, the recurrent policy network takes a current hallucination result It−1 and action
history vector encoded by LSTM (512 hidden states) as the input and then outputs the action prob-
abilities for all W × H locations, where W and H are the width and height of the input image,
respectively. The policy network first encodes the It−1 with one fully connected layer (256 neu-
rons) and then fuses the encoded image and the action vector with an LSTM layer. Finally, a fully
connected linear layer is appended to generate the W × H -way probabilities. Based on the proba-
bility map, we extract the local patch and then pass the patch and It−1 into the local enhancement
network to generate the enhanced patch. The local enhancement network is constructed by two fully
connected layers (each with 256 neurons) encoding It−1 and 8 cascaded convolutional layers for
image patch enhancement. Thus, a new face hallucination result can be generated by replacing the
local patch with an enhanced patch
Action: Given a facial image I with size W × H , the agent selects one action from
all possible locations lt = (x, y|1 ≤ x ≤ W, 1 ≤ y ≤ H ). As shown in Fig. 3.7, at
each time step t, the policy network f π first encodes the current hallucinated facial
image It−1 with a fully connected layer. Then, the LSTM unit in the policy network
fuses the encoded vector with the history action vector h t−1 . Ultimately, a final linear
layer is appended to produce a W × H -way vector, which indicates the probabilities
of all available actions P(lt = (x, y)|st−1 ), with each entry (x, y) indicating the
probability of the next attached patch located in position (x, y). The agent then
takes action lt by stochastically drawing an entry following the action probability
distribution. During testing, we select the location lt with the highest probability.
Reward: The reward is applied to guide the agent to learn the sequential poli-
cies to obtain the entire action sequence. Because our model targets generating a
hallucinated facial image, we define the reward according to the mean squared error
(MSE) after enhancing T attended local patches at the selected locations with the
local enhancement network. Given the fixed local enhancement network f e , we first
compute the final face hallucination result IT by sequentially enhancing a list of
local patches mined by l = l1,2,...,T . The MSE loss is thus obtained by computing
L θπ = E p(l;π) [||Ihr − IT ||2 ], where p(l; π) is the probability distribution produced
by the policy network f π . The reward r at the t-th step can be set as
0 t<T
rt = (3.11)
−L θπ t = T.
When the discounted factor is set as 1, the total discounted reward will be R =
−L θπ .
42 3 Face Localization and Enhancement
Our attention-FH framework jointly trains the parameters θπ of the recurrent policy
network f π and parameters θe of the local enhancement network f e . We introduce a
reinforcement learning scheme to perform joint optimization.
First, we optimize the recurrent policy network with the REINFORCE algorithm
[35] guided by the reward given at the end of sequential enhancement. The local
enhancement network is optimized with mean squared error between the enhanced
patch and the corresponding patch from the ground truth high-resolution image.
This supervised loss is calculated at each time step and can be minimized based on
backpropagation.
Because we jointly train the recurrent policy network and local enhancement
network, the change of parameters in the local enhancement network will affect the
final face hallucination result, which in turn will cause a nonstationary objective for
the recurrent policy network. We further employ the variance reduction strategy, as
mentioned in [36], to reduce variance due to the moving rewards during the training
procedure.
3.4 Attention-Aware Face Hallucination 43
3.4.5 Experiments
Table 3.1 Comparison between our method and others in terms of PSNR, SSIM, and FSIM eval-
uation metrics
Methods LFW-funneled 8× BioID 8×
PSNR SSIM FSIM PSNR SSIM FSIM
Bicubic 21.92 0.6712 0.7824 20.68 0.6434 0.7539
SFH [40] 22.12 0.6732 0.7832 20.31 0.6234 0.7238
BCCNN 22.62 0.6801 0.7903 21.40 0.6504 0.7621
[23]
MZQ [41] 22.12 0.6771 0.7802 21.11 0.6481 0.7594
SRCNN 23.92 0.6927 0.8314 22.34 0.6980 0.8274
[26]
VDSR [32] 24.12 0.7031 0.8391 24.31 0.7321 0.8465
GLN [33] 24.51 0.7109 0.8405 24.76 0.7421 0.8525
Our method 26.17 0.7604 0.8630 26.56 0.7864 0.8748
44 3 Face Localization and Enhancement
References
1. L. Liu, G. Li, Y. Xie, Y. Yu, Q. Wang, L. Lin, Facial landmark machines: a backbone-branches
architecture with progressive representation learning. IEEE Trans. Multimedia. https://doi.org/
10.1109/TMM.2019.2902096
2. Q. Cao, L. Lin, Y. Shi, X. Liang, G. Li, Attention-aware face hallucination via deep reinforce-
ment learning, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Honolulu, HI, pp. 1656–1664 (2017). https://doi.org/10.1109/CVPR.2017.180
3. P. Luo, X. Wang, X. Tang, A deep sum-product architecture for robust facial attributes analysis,
in ICCV, pp. 2864–2871 (2013)
4. C. Lu, X. Tang, Surpassing human-level face verification performance on lfw with gaussianface,
in AAAI (2015)
5. L. Liu, C. Xiong, H. Zhang, Z. Niu, M. Wang, S. Yan, Deep aging face verification with large
gaps. TMM 18(1), 64–75 (2016)
6. Z. Zhu, P. Luo, X. Wang, X. Tang, Deep learning identity-preserving face space, in ICCV, pp.
113–120 (2013)
7. C. Ding, D. Tao, Robust face recognition via multimodal deep face representation. TMM
17(11), 2049–2058 (2015)
8. Y. Li, L. Liu, L. Lin, Q. Wang, Face recognition by coarse-to-fine landmark regression with
application to atm surveillance, in CCCV (Springer, 2017), pp. 62–73
9. P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in CVPR,
vol. 1. IEEE, pp. I–511 (2001)
10. X. Zhu, D. Ramanan, Face detection, pose estimation, and landmark localization in the wild,
in CVPR (IEEE, 2012), pp. 2879–2886
11. Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste, W. Di, Y. Yu, Hd-cnn: hierarchical
deep convolutional neural networks for large scale visual recognition, in ICCV, pp. 2740–2748
(2015)
12. M. Köstinger, P. Wohlhart, P.M. Roth, H. Bischof, Annotated facial landmarks in the wild: a
large-scale, real-world database for facial landmark localization, in ICCV Workshops (IEEE,
2011), pp. 2144–2151
13. Z. Zhang, P. Luo, C.C. Loy, X. Tang, Facial landmark detection by deep multi-task learning,
in ECCV (Springer, 2014), pp. 94–108
14. X. Burgos-Artizzu, P. Perona, P. Dollár, Robust face landmark estimation under occlusion, in
ICCV, pp. 1513–1520 (2013)
15. X. Cao, Y. Wei, F. Wen, J. Sun, Face alignment by explicit shape regression. IJCV 107(2),
177–190 (2014)
16. X. Yu, J. Huang, S. Zhang, W. Yan, D. Metaxas, Pose-free facial landmark fitting via optimized
part mixtures and cascaded deformable shape model, in ICCV, pp. 1944–1951 (2013)
17. X. Xiong, F. Torre, Supervised descent method and its applications to face alignment, in CVPR,
pp. 532–539 (2013)
18. K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using multitask cascaded
convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)
19. S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, A. Kassim, Robust facial landmark detection via
recurrent attentive-refinement networks, in ECCV (Springer, 2016), pp. 57–72
20. Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in ICCV, pp.
3730–3738 (2015)
21. Z. Zhang, P. Luo, C.C. Loy, X. Tang, Learning deep representation for face alignment with
auxiliary attributes. IEEE Trans. Pattern Anal. Mach. Intell. 38(5), 918–930 (2016)
22. E. Zhou, Z. Cao, Q. Yin, Naive-deep face recognition: touching the limit of lfw benchmark or
not? arXiv preprint arXiv:1501.04690 (2015)
23. E. Zhou, H. Fan, Z. Cao, Y. Jiang, Q. Yin, Learning face hallucination in the wild, in AAAI,
pp. 3871–3877 (2015)
24. S. Zhu, S. Liu, C.C. Loy, X. Tang, Deep cascaded bi-network for face hallucination. arXiv
preprint arXiv:1607.05046 (2016)
References 45
25. C. Liu, H.-Y. Shum, W.T. Freeman, Face hallucination: theory and practice. Int. J. Comput.
Vis. 75(1), 115–134 (2007)
26. C. Dong, C.C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-
resolution, in ECCV, pp. 184–199 (2014)
27. J. Najemnik, W.S. Geisler, Optimal eye movement strategies in visual search. Nature 434(7031),
387–391 (2005)
28. Y. Sun, D. Liang, X. Wang, X. Tang, Deepid3: face recognition with very deep neural networks.
arXiv preprint arXiv:1502.00873 (2015)
29. J.C. Caicedo, S. Lazebnik, Active object localization with deep reinforcement learning, in
ICCV, pp. 2488–2496 (2015)
30. K. Gregor, I. Danihelka, A. Graves, D.J. Rezende, D. Wierstra, DRAW: a recurrent neural
network for image generation, in ICLR, pp. 1462–1471 (2015)
31. D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre et al., Mastering the game of go with
deep neural networks and tree search. Nature 529, 484–503 (2016)
32. J. Kim, J.K. Lee, K.M. Lee, Accurate image super-resolution using very deep convolutional
networks (2016)
33. O. Tuzel, Y. Taguchi, J.R. Hershey, Global-local face upsampling network. arXiv preprint
arXiv:1603.07235 (2016)
34. S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, L. Zhang, Convolutional sparse coding for image
super-resolution, in ICCV, pp. 1823–1831 (2015)
35. R.J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Mach. Learn. 8(3), 229–256 (1992)
36. V. Mnih, N. Heess, A. Graves, K. kavukcuoglu, Recurrent models of visual attention, in NIPS,
pp. 2204–2212 (2014)
37. O. Jesorsky, K.J. Kirchberg, R. Frischholz, Robust face detection using the hausdorff distance,
in AVBPA, pp. 90–95 (2001)
38. G.B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: A database for
studying face recognition in unconstrained environments. Technical Report 07-49, University
of Massachusetts, Amherst, October 2007
39. G.B. Huang, V. Jain, E. Learned-Miller, Unsupervised joint alignment of complex images, in
ICCV (2007)
40. C.-Y. Yang, S. Liu, M.-H. Yang, Structured face hallucination, in CVPR, pp. 1099–1106 (2013)
41. X. Ma, J. Zhang, C. Qi, Hallucinating face by position-patch. Pattern Recogn. 43(6), 2224–2236
(2010)
42. D.P. Kingma, J. Ba. Adam: amethod for stochastic optimization, in ICLR (2015)
43. T. Chen, L. Lin, L. Liu, X. Luo, X. Li, Disc: deep image saliency computing via progressive
representation learning. TNNLS 27(6), 1135–1149 (2016)
44. L. Liu, H. Wang, G. Li, W. Ouyang, L. Lin, Crowd counting using deep recurrent spatial-aware
network, in IJCAI (2018)
45. L. Liu, R. Zhang, J. Peng, G. Li, B. Du, L. Lin, Attentive crowd flow machines, in ACM MM
(ACM, 2018), pp. 1553–1561
46. Y. Sun, X. Wang, X. Tang, Deep convolutional network cascade for facial point detection, in
CVPR, pp. 3476–3483 (2013)
47. R. Weng, J. Lu, Y.-P. Tan, J. Zhou, Learning cascaded deep auto-encoder networks for face
alignment. TMM 18(10), 2066–2078 (2016)
Chapter 4
Pedestrian Detection with RPN
and Boosted Forest
Abstract Although recent deep learning object detectors have shown excellent per-
formance for general object detection, they have limited success in detecting pedestri-
ans; therefore, previous leading pedestrian detectors were generally hybrid methods
combining handcrafted and deep convolutional features. In this chapter, we propose a
very simple but effective baseline for pedestrian detection using an RPN followed by
boosted forest on shared high-resolution convolutional feature maps. We comprehen-
sively evaluate this method on several benchmarks and find that it shows competitive
accuracy and good speed.
4.1 Introduction
Fig. 4.1 Two challenges for Fast/Faster R-CNN in pedestrian detection. a Small objects for which
ROI pooling on low-resolution feature maps may fail. b Hard negative examples that receive no
careful attention in Fast/Faster R-CNN
4.2 Approach
Our approach consists of two components (illustrated in Fig. 4.2): an RPN that gen-
erates candidate boxes as well as convolutional feature maps and a boosted forest
that classifies these proposals using these convolutional features.
The RPN in Faster R-CNN [1] was developed as a class-agnostic detector (proposer)
in the scenario of multicategory object detection. For single-category detection, RPN
is naturally a detector for the only category concerned. We specifically tailor the RPN
for pedestrian detection, as introduced in the following sections.
We adopt anchors (reference boxes) with a single aspect ratio of 0.41 (width to
height). This is the average aspect ratio of pedestrians, as indicated in [4]. This ap-
proach differs from that of the original RPN, which has anchors with multiple aspect
ratios. Anchors with inappropriate aspect ratios are associated with few examples
and thus are noisy and harmful for detection accuracy. In addition, we use anchors of
9 different scales, starting from a 40-pixel height with a scaling stride of 1.3×. This
spans a wider range of scales than the original RPN. The usage of multiscale anchors
enables us to waive the requirement of using feature pyramids to detect multiscale
objects.
We adopt the VGG-16 net [15] pretrained on the ImageNet dataset [16] as the
backbone network. The RPN is built on top of the Conv5_3 layer, which is followed by
an intermediate 3 × 3 convolutional layer and two sibling 1 × 1 convolutional layers
for classification and bounding box regression. In this way, RPN regresses boxes
with a stride of 16 pixels (Conv5_3). The classification layer provides confidence
scores for the predicted boxes, which can be used as the initial scores of the boosted
forest cascade that follows.
Scores
…
Features
Fig. 4.2 Our pipeline. An RPN is used to compute candidate bounding boxes, scores, and convolu-
tional feature maps. The candidate boxes are fed into cascaded boosted forest (BF) for classification,
using the features pooled from the convolutional feature maps computed by the RPN
50 4 Pedestrian Detection with RPN and Boosted Forest
With the proposals generated by the RPN, we adopt ROI pooling [2] to extract fixed-
length features from regions. These features are used to train BF, as described in the
next section. Unlike Faster R-CNN, which requires that these features be fed into the
original fully connected (fc) layers and thus limits their dimensions, the BF classifier
imposes no constraint on the dimensions of the features. For example, we can extract
features from ROIs on Conv3_3 (stride = 4 pixels) and Conv4_3 (stride = 8 pixels).
We pool the features into a fixed resolution of 7 × 7. These features from different
layers are simply concatenated without normalization owing to the flexibility of the
BF classifier; in contrast, feature normalization must be carefully addressed [17] for
deep classifiers when concatenating features.
Remarkably, as there is no constraint imposed on feature dimensions, we have the
flexibility to use features with increased resolution. In particular, given the fine-tuned
layers from the RPN (stride = 4 on Conv3, 8 on Conv4, and 16 on Conv5), we can
use the à trous trick [6] to compute higher resolution convolutional feature maps. For
example, we can set the stride of Pool3 at 1 and dilate all Conv4 filters by 2, which
reduces the stride of Conv4 from 8 to 4. In contrast to previous methods [6, 7] that
fine-tune the dilated filters, in our method, we use them only for feature extraction
and do not fine-tune a new RPN.
Although we adopt the same ROI resolution (7 × 7) as that of Faster R-CNN
[1], these ROIs are on higher resolution feature maps (e.g., Conv3_3, Conv4_3, or
Conv4_3 à trous) than Fast R-CNN (Conv53̆). If an ROI input resolution is smaller
than the output (i.e., <7 × 7), the pooling bins collapse, and the features become
“flat” and not discriminative. This problem is alleviated in our method, as it is not
constrained to use Conv5_3 features in the downstream classifier.
The RPN generates region proposals, confidence scores, and features, all of which are
used to train a cascaded boosted forest classifier. We adopt the RealBoost algorithm
[8] and mainly follow the hyperparameters in [18]. Formally, we bootstrap the train-
ing 6 times, and the forest in each stage has {64, 128, 256, 512, 1024, 1536} trees.
Initially, the training set consists of all positive examples (∼50k on the Caltech set)
and the same number of randomly sampled negative examples from the proposals.
After each stage, additional hard negative examples (whose number is 10% of the
positives, ∼5k on Caltech) are mined and added to the training set. Finally, a forest
of 2048 trees is trained after all bootstrapping stages. This final forest classifier is
used for inference.
We note that it is not necessary to treat the initial proposals equally because the
initial confidence scores of the proposals are computed by the RPN. In other words,
4.2 Approach 51
the RPN can be considered as the stage-0 classifier f 0 , and we set f 0 = 21 log 1−s
s
Caltech Figures 4.3 and 4.5 show the results on the Caltech dataset. When original
annotations are used (Fig. 4.3), our method has an MR of 9.6%, which is more than 2
points better than that of the closest competitor (11.7% of CompactACT-Deep [18]).
When the corrected annotations are used (Fig. 4.5), our method has an MR−2 of 7.3%
and an MR−4 of 16.8%, both of which are 2 points better than those of the previous
best methods.
1
.80 23.3% SCF+AlexNet
.64 22.5% Katamari
.50 21.9% SpatialPooling+
.40 21.9% SCCPriors
.30 20.9% TA-CNN
miss rate
1
.80 68.6% CCF
.64 49.0% Katamari
.50 48.9% LDCF
.40 48.6% SCCPriors
.30 47.2% SCF+AlexNet
miss rate
Fig. 4.4 Comparisons on the Caltech set using an IoU threshold of 0.7 to determine true positives
(legends indicate MR)
52 4 Pedestrian Detection with RPN and Boosted Forest
1
.80 23.7(42.6)% CCF
.64 23.7(38.3)% LDCF
.50 22.3(42.0)% CCF+CF
.40 22.2(34.6)% Katamari
.30 21.6(34.6)% SCF+AlexNet
miss rate
21.6(36.0)% SpatialPooling+
.20 19.2(34.0)% SCCPriors
18.8(34.3)% TA-CNN
.10 16.3(28.7)% Checkerboards+
15.8(28.6)% Checkerboards
12.9(25.2)% DeepParts
.05 9.2(18.8)% CompACT-Deep
7.3(16.8)% RPN+BF [Ours]
10 -2 10 0
false positives per image
Fig. 4.5 Comparisons on the Caltech-New set (legends indicate MR−2 (MR−4 ))
1
.80 16.0% VeryFast
.64 16.0% WordChannels
.50 15.4% RandForest
.40 15.1% NAMC
.30 14.5% SCCPriors
miss rate
In addition, except for CCF (MR 18.7%) [19], our method (MR 9.6%) is the
only method that uses no handcrafted features. Our results suggest that handcrafted
features are not essential for good accuracy on the Caltech dataset; rather, high-
resolution features and bootstrapping, both of which are missing in the original Fast
R-CNN detector, are the keys to good accuracy.
Figure 4.4 shows the results on Caltech, where an IoU threshold of 0.7 is
used to determine true positives (instead of 0.5 by default). With this more chal-
lenging metric, most methods exhibit a dramatic performance decrease; e.g., the
MR of CompactACT-Deep [18]/DeepParts [20] increases from 11.7%/11.9% to
38.1%/40.7%. Our method has an MR of 23.5%, which is a relative improvement
of ∼40% over that of the closest competitors. This comparison demonstrates that
our method has substantially better localization accuracy than other methods. It also
indicates that there is much room to improve localization performance on this widely
evaluated dataset.
INRIA and ETH Figures 4.6 and 4.7 show the results on the INRIA and ETH
datasets. On the INRIA dataset, our method achieves an MR of 6.9%, which is
4.3 Experiments and Analysis 53
1
.80 89.9% VJ
.64 64.2% HOG
49.4% MLS
.50
47.3% MF+Motion+2Ped
.40
47.0% DBN-Isol
.30 45.3% JointDeep
miss rate
45.0% RandForest
.20 45.0% LDCF
44.8% FisherBoost
43.5% Roerei
.10 41.1% DBN-Mut
40.0% Franken
37.4% SpatialPooling
.05 35.0% TA-CNN
30.2% RPN+BF [Ours]
10 -2 10 0
false positives per image
considerably better than that of the best available competitor (11.2%). On the ETH
set, our result (30.2%) is better than that of the previous leading method (TA-CNN
[21]) by 5 points.
References
1. R. Girshick, S. Ren, K. He, J. Sun, Faster r-cnn: Towards real-time object detection with region
proposal networks, in Neural Information Processing Systems (NIPS) (2015)
2. R. Girshick, Fast r-cnn, in IEEE International Conference on Computer Vision (ICCV) (2015)
3. J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, Selective search for object
recognition. IJCV 104(2), 154–171 (2013)
4. B. Schiele, P. Dollár, C. Wojek, P. Perona, Pedestrian detection: an evaluation of the state of
the art (2012)
5. S.R.K. He, X. Zhang, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual
recognition, in European Conference on Computer Vision (ECCV) (2014)
6. I. Kokkinos, K. Murphy, L.-C. Chen, G. Papandreou, A.L. Yuille, Semantic image segmentation
with deep convolutional nets and fully connected crfs, in arXiv:1412.7062 (2014)
7. R. Tibshirani, J. Friedman, T. Hastie, Additive logistic regression: a statistical view of boosting,
in The annals of statistics (2000)
8. P. Dollár R. Appel, T. Fuchs, P. Perona, Quickly boosting decision trees-pruning underachieving
features early, in International Conference on Machine Learning (ICML) (2013)
9. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (2005)
10. B. Leibe, A. Ess, L. Van Gool, Depth and appearance for mobile scene analysis, in IEEE
International Conference on Computer Vision (ICCV) (2007)
11. P. Lenz, A. Geiger, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark
suite, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
12. P. Perona, P. Dollár, Z. Tu, S. Belongie, Integral channel features, in British Machine Vision
Conference (BMVC) (2009)
13. S. Belongie, P. Dollár, R. Appel, P. Perona, Fast feature pyramids for object detection (2014)
54 4 Pedestrian Detection with RPN and Boosted Forest
14. KarVery deep convolutional networks for large-scale image recognitionen Simonyan and
Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In
arXiv:1409.1556 (2014)
15. H. Su, J. Krause, S. Satheesh, , Ma Zhiheng, H, Andrej, K, Aditya, K, Michael, B. Alexan-
der, C. Berg O. Russakovsky, J. Deng, L. Fei-Fei, Imagenet large scale visual recognition
challenge (2015)
16. Andrew Rabinovich Wei Liu and A.C. Berg, Parsenet: looking wider to see better. page
arXiv:1506.04579 (2015)
17. M. Saberian, Z. Cai, N. Vasconcelos, Learning complexity-aware cascades for deep pedestrian
detection, in IEEE International Conference on Computer Vision (ICCV) (2015)
18. B. Yang, J. Yan, Z. Lei, S. Z. Li, Convolutional channel features, in ICCV, pp. 82–90 (2015)
19. X. Wang, Y. Tian, P. Luo, X. Tang, Deep learning strong parts for pedestrian detection, in IEEE
International Conference on Computer Vision (ICCV) (2015)
20. X. Wang, Y. Tian, P. Luo, X. Tang, Pedestrian detection aided by deep learning semantic tasks,
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
21. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in
CVPR, pp. 3431–3440 (2015)
Part III
Parsing Person in Detail
References
Abstract Human parsing has recently attracted much research interest due to its
enormous application potential. In this chapter, we introduce a new benchmark,
“Look into Person (LIP),” that makes a significant advance in terms of scalability,
diversity, and difficulty, a contribution that we feel is crucial for future developments
in human-centric analysis. Furthermore, in contrast to the existing efforts to improve
feature discriminative capability, we solve human parsing by exploring a novel self-
supervised structure-sensitive learning approach that imposes human pose structures
on the parsing results without requiring extra supervision. Our self-supervised learn-
ing framework can be injected into any advanced neural network to help incorporate
rich high-level knowledge regarding human joints from a global perspective and
improve the parsing results( c
[2019] IEEE., Reprinted, with permission, from [1]).
5.1 Introduction
Human parsing aims to segment a human image into multiple parts with fine-grained
semantics and provide a more detailed understanding of image content. It can simu-
late many higher level computer vision applications [2], such as person reidentifica-
tion [3] and human behavior analysis [4, 5].
Recently, convolutional neural networks (CNNs) have achieved exciting success
in human parsing [6–8]. Nevertheless, as demonstrated in many other problems such
as object detection [9] and semantic segmentation [10], the performance of such
CNN-based approaches relies heavily on the availability of annotated images for
training. To train a human parsing network with potentially practical value in real-
world applications, it is highly desirable to have a large-scale dataset composed of
representative instances with varied clothing appearances, strong articulation, partial
(self-)occlusions, truncation at image borders, diverse viewpoints, and background
clutter. Although training sets exist for special scenarios such as fashion pictures [6,
8, 11, 12] and people in constrained situations (e.g., upright) [13], these datasets
are limited in their coverage and scalability, as shown in Fig. 5.1. The largest public
human parsing dataset [8] thus far contains only 17,000 fashion images, while others
include only thousands of images (Table 5.1).
Fig. 5.1 Annotation examples for our “Look into Person (LIP)” dataset and existing datasets. a
The images in the ATR dataset are of fixed size and contain only instances of persons standing up
in the outdoors. b The images in the PASCAL-Person-Part dataset also have lower scalability and
contain only 6 coarse labels. c The images in our LIP dataset have high appearance variability and
complexity
Table 5.1 Overview of the publicly available datasets for human parsing. For each dataset, we
report the number of annotated persons in the training, validation, and test sets as well as the
number of categories, including background
Dataset #Training #Validation #Test Categories
Fashionista [14] 456 − 229 56
PASCAL-Person- 1,716 − 1,817 7
Part
[13]
ATR [8] 16,000 700 1,000 18
LIP 30,462 10,000 10,000 20
However, to the best of our knowledge, no attempt has been made to establish a
standard representative benchmark aiming to cover a wide range of challenges for the
human parsing task. The existing datasets do not provide an evaluation server with a
secret test set to avoid potential dataset overfitting, which hinders further development
in this area. Therefore, we propose a new benchmark, “Look into Person (LIP)”, and a
public server to automatically report evaluation results. Our benchmark significantly
advances the state of the art in terms of appearance variability and complexity, as
it includes 50,462 human images with pixel-wise annotations of 19 semantic parts
(Fig. 5.2).
5.2 Look into Person Benchmark 61
Fig. 5.2 An example shows that self-supervised structure-sensitive learning is helpful for human
parsing. a The original image. b Parsing results by attention-to-scale [15], with the left arm wrongly
labeled as the right arm. c Our parsing results successfully incorporate the structure information to
generate reasonable outputs
With 50,462 annotated images, LIP is an order of magnitude larger and more chal-
lenging than previous similar attempts [8, 13, 14]. It is annotated with elaborated
pixel-wise annotations with 19 semantic human part labels and one background label.
The images collected from real-world scenarios contain people appearing with chal-
lenging poses and viewpoints, heavy occlusions, various appearances, and a wide
range of resolutions. Furthermore, the backgrounds of the images in the LIP dataset
are more complex and diverse than those in previous datasets. Some examples are
shown in Fig. 5.1.
• Image Annotation The images in the LIP dataset are cropped person instances
from the Microsoft COCO [16] training and validation sets. We defined 19 human
parts or clothing labels for annotation, hat, hair, sunglasses, upper clothing, dress,
coat, socks, pants, gloves, scarf, skirt, jumpsuit, face, right arm, left arm, right leg,
left leg, right shoe, and left shoe, as well as a background label. We implement
an annotation tool and generate multiscale superpixels of images based on [17] to
speed up the annotation.
• Dataset splits In total, there are 50,462 images in the LIP dataset, including 19,081
full-body images, 13,672 upper body images, 403 lower body images, 3,386 head-
missing images, 2,778 back-view images, and 21,028 images with occlusions. We
divide the images into separate training, validation, and test sets. Following random
selection, we arrive at a unique division consisting of 30,462 training and 10,000
validation images with publicly available annotations as well as 10,000 test images
with the annotations withheld for benchmarking purposes.
• Dataset statistics In this section, we analyze the images and categories in the LIP
dataset in detail. In general, the face, arms, and legs are the most identifiable
parts of a human body. However, human parsing aims to analyze every detailed
region of a person, including different body parts as well as different categories
of clothing. We, therefore, define 6 body parts and 13 clothing categories. Among
62 5 Self-supervised Structure-Sensitive Learning for Human Parsing
Fig. 5.3 The data distribution of the 19 semantic part labels in the LIP dataset
the 6 body parts, we divide arms and legs into left and right sides for more precise
analysis, which also increases the difficulty of the task. For clothing classes, we
include not only common clothing, such as upper clothing, pants, and shoes, but
also infrequent categories, such as skirts and jumpsuits. Furthermore, small-scale
accessories, such as sunglasses, gloves, and socks, are also taken into account. The
numbers of images for each semantic part label are presented in Fig. 5.3.
The images in the LIP dataset contain diverse human appearances, viewpoints,
and occlusions. Additionally, more than half of the images suffer occlusions of
different degrees. Occlusion is considered to occur if any of the 19 semantic parts
appear in the image but are occluded or invisible. In more challenging cases, the
images contain back-view instances, which give rise to greater ambiguity in the
left and right spatial layouts. The numbers of images of different appearances
(i.e., occlusion, full-body, upper body, head-missing, back-view, and lower body
images) are summarized in Fig. 5.4.
Fig. 5.4 The numbers of images that show diverse types of visibility in the LIP dataset, including
occlusion, full-body, upper body, lower body, head-missing, and back-view images
Fig. 5.5 Illustration of self-supervised structure-sensitive learning for human parsing. An input
image goes through parsing networks, including several convolutional layers, to generate the parsing
results. The generated joints and joints ground truth, represented as heatmaps, are obtained by
computing the center points of the corresponding regions in parsing maps, including head (H),
upper body (U), lower body (L), right arm (RA), left arm (LA), right leg (RL), left leg (LL), right
shoe (RS), and left shoe (LS). The structure-sensitive loss is generated by weighting segmentation
loss with joint structure loss. For clear observation, we combine nine heatmaps into one map
errors. The predicted joints do not have high enough quality to guide human parsing
compared with the joints extracted from parsing annotations. Moreover, the joints
in pose estimation are not aligned with parsing annotations. For example, the arms
are labeled as arms for parsing annotations only if they are not covered by clothing,
while the pose annotations are independent of clothing. To address these issues in this
work, we investigate how to leverage informative high-level structure cues to guide
pixel-wise prediction. We propose a novel self-supervised structure-sensitive learn-
ing for human parsing, which introduces a self-supervised structure-sensitive loss to
64 5 Self-supervised Structure-Sensitive Learning for Human Parsing
evaluate the quality of predicted parsing results from a joint structure perspective, as
illustrated in Fig. 5.5.
Specifically, in addition to using the traditional pixel-wise annotations for super-
vision, we generate the approximated human joints directly from the parsing anno-
tations, which can also guide human parsing training. To explicitly enforce semantic
consistency between the produced parsing results and human joint structures, we
treat the joint structure loss as a weight of segmentation loss, which becomes our
structure-sensitive loss.
Generally, for the human parsing task, no extensive information is provided other
than the pixel-wise annotations. Instead of using augmentative information, we must
obtain a structure-sensitive supervision from the parsing annotations. As the human
parsing results are semantic parts with pixel-level labels, we try to explore the pose
information contained in human parsing results. We define 9 joints to construct a pose
structure, which are the centers of the regions of the head, upper body, lower body,
left arm, right arm, left leg, right leg, left shoe, and right shoe. The region of the head
is generated by merging the parsing labels of hat, hair, sunglasses, and face. Similarly,
upper clothing, coat, and scarf are merged into upper body and pants and skirt into
lower body. The remaining regions can also be obtained by the corresponding labels.
Some examples of human joints generated for different humans are shown in Fig. 5.6.
Following [23], for each parsing result and corresponding ground truth, we compute
the center points of regions to more smoothly represent joints as heatmaps for training.
Then, we use the Euclidean metric to evaluate the quality of the generated joint
structures, which also reflect the structural consistency between the predicted parsing
results and the ground truth. Finally, the pixel-wise segmentation loss is weighted by
the joint structure loss, which becomes our structure-sensitive loss. Consequently,
the overall human parsing networks become self-supervised with structure-sensitive
loss.
p
Formally, given an image I , we define a list of joint configurations C IP = {ci |i ∈
p
[1, N ]}, where ci is the heatmap of the ith joint computed according to the parsing
gt
result map. Similarly, C IGT = {ci |i ∈ [1, N ]}, which is obtained from the corre-
sponding parsing ground truth. Here, N is a variate decided by the human bodies in
the input images and is equal to 9 for a full-body image. For the joints missing from
the image, we simply replace the heatmaps with maps filled in with zeros. The joint
structure loss is the Euclidean (L2) loss, calculated as
1 p
N
gt
L Joint = c − ci 22 , (5.1)
2N i=1 i
5.3 Self-supervised Structure-Sensitive Learning 65
Fig. 5.6 Some examples of self-supervised human joints generated from our parsing results for
different bodies
(a)
(b)
(c)
(d)
(e)
Fig. 5.7 Visualized comparison of human parsing results on the LIP validation set. a Upper body
images. b The back-view images. c The head-missing images. d The images with occlusion. e The
full-body images
66 5 Self-supervised Structure-Sensitive Learning for Human Parsing
where L Parsing is the pixel-wise softmax loss calculated based on the parsing annota-
tions.
We refer to our learning framework as “self-supervised” as the abovementioned
structure-sensitive loss can be generated from the existing parsing results with no
additional information. Our self-supervised learning framework thus has excellent
adaptability and extensibility and can be injected into any advanced network to
incorporate rich high-level knowledge about human joints from a global perspective
(Fig. 5.7).
Table 5.2 Performance comparison in terms of per-class IoU with four state-of-the-art methods
on the LIP validation set
Method Hat Hair Gloves Sunglasses u-clothes Dress Coat Socks Pants Jumpsuit
SegNet [18] 26.60 44.01 0.01 0.00 34.46 0.00 15.97 3.59 33.56 0.01
FCN-8s [19] 39.79 58.96 5.32 3.08 49.08 12.36 26.82 15.66 49.41 6.48
DeepLabV2 [20] 57.94 66.11 28.50 18.40 60.94 23.17 47.03 34.51 64.00 22.38
Attention [15] 58.87 66.78 23.32 19.48 63.20 29.63 49.70 35.23 66.04 24.73
DeepLabV2 + SSL 58.41 66.22 28.76 20.05 62.26 21.18 48.17 36.12 65.16 22.94
Attention + SSL 59.75 67.25 28.95 21.57 65.30 29.49 51.92 38.52 68.02 24.48
Table 5.3 Comparison of person part segmentation performance with four state-of-the-art methods
on the PASCAL-Person-Part dataset [13]
Method head torso u-arms l-arms u-legs l-legs Bkg Avg
DeepLab-Large 78.09 54.02 37.29 36.85 33.73 29.61 92.85 51.78
FOV [20]
HAZN [24] 80.79 59.11 43.05 42.76 38.99 34.46 93.59 56.11
Attention [15] 81.47 59.06 44.15 42.50 38.28 35.62 93.65 56.39
LG-LSTM [7] 82.72 60.99 45.40 47.76 42.33 37.96 88.63 57.97
Attention + SSL 83.26 62.40 47.80 45.58 42.32 39.48 94.68 59.36
5.3 Self-supervised Structure-Sensitive Learning 67
ing [15, 24], the annotations are merged into six person part classes, head, torso,
upper/lower arms and upper/lower legs, and one background class. The second is
our large-scale LIP dataset, which is highly challenging, with high pose complex-
ity, heavy occlusions, and body truncation, as introduced and analyzed in Sect. 5.2
(Tables 5.2 and 5.3).
References
1. K. Gong, X. Liang, D. Zhang, X. Shen, L. Lin, Look into Person: Self-Supervised Structure-
Sensitive Learning and a New Benchmark for Human Parsing, in 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Honolulu, pp. 6757–6765 (2017)
2. H. Zhang, G. Kim, E.P. Xing, Dynamic topic modeling for monitoring market competition from
online text and image data, in ACM SIGKDD (ACM, 2015)
3. R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-identification, in
CVPR (2013)
4. C. Gan, M. Lin, Y. Yang, G. de Melo, A.G. Hauptmann, Concepts not alone: Exploring pairwise
relationships for zero-shot video activity recognition, in AAAI (2016)
5. X. Liang, Y. Wei, X. Wei, J. Wei, L. Lin, S. Yan, Proposal-free network for instance-level object
segmentation. arXiv preprint arXiv:1509.02636 (2015)
6. X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, S. Yan, Deep human parsing with
active template regression, in TPAMI (2015)
7. X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, S. Yan, Semantic object parsing with local-global
long short-term memory, in CVPR (2016)
8. X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, S. Yan, Human parsing with
contextualized convolutional neural network, in ICCV (2015)
9. X. Liang, S. Liu, Y. Wei, L. Liu, L. Lin, S. Yan, Towards computational baby learning: a
weakly-supervised approach for object detection, in Proceedings of the IEEE International
Conference on Computer Vision, pp. 999–1007 (2015)
10. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. Torr,
Conditional random fields as recurrent neural networks, in ICCV (2015)
11. K. Yamaguchi, M. Kiapour, T. Kiapour, Paper doll parsing: Retrieving similar styles to parse
clothing items, in ICCV (2013)
12. J. Dong, Q. Chen, W. Xia, Z. Huang, S. Yan, A deformable mixture parsing model with parselets,
in ICCV (2013)
13. X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, et al., Detect what you can: detecting and
representing objects using holistic models and body parts, in CVPR (2014)
14. K. Yamaguchi, M. Kiapour, L. Ortiz, T. Berg, Parsing clothing in fashion photographs, in CVPR
(2012)
15. L.C. Chen, Y. Yang, J. Wang, W. Xu, A.L. Yuille, Attention to scale: Scale-aware semantic
image segmentation, in CVPR (2016)
16. T. Lin, M. Maire, S.J. Belongie, L.D. Bourdev, R.B. Girshick, J. Hays, P. Perona, D. Ramanan,
P. Dollár, C.L. Zitnick, Microsoft COCO: common objects in context. CoRR abs/1405.0312
(2014)
17. Pablo Arbelaez, Michael Maire, Charless Fowlkes, Jitendra Malik, Contour detection and
hierarchical image segmentation. TPAMI 33(5), 898–916 (2011). May
18. V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: adeep convolutional encoder-decoder archi-
tecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
19. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation,
arXiv preprint arXiv:1411.4038 (2014)
68 5 Self-supervised Structure-Sensitive Learning for Human Parsing
20. I.Kokkinos, K. Murphy, L.-C. Chen, G. Papandreou, A.L. Yuille, Semantic image segmentation
with deep convolutional nets and fully connected crfs. arXiv:1412.7062 (2014)
21. W. Yang, W. Ouyang, H. Li, X. Wang, End-to-end learning of deformable mixture of parts and
deep convolutional neural networks for human pose estimation, in CVPR (2016)
22. X. Chen, A. Yuille, Articulated pose estimation by a graphical model with image dependent
pairwise relations, in NIPS (2014)
23. T. Pfister, J. Charles, A. Zisserman, Flowing convnets for human pose estimation in videos, in
ICCV (2015)
24. F. Xia, P. Wang, L.C. Chen, A.L. Yuille, Zoom better to see clearer: Huamn part segmentation
with auto zoom net, in ECCV (2016)
Chapter 6
Instance-Level Human Parsing
6.1 Introduction
Human parsing for recognizing each semantic part (e.g., arms, legs) is one of the
most fundamental and critical tasks in analyzing humans in the wild and plays an
important role in higher level application domains such as video surveillance [1] and
human behavior analysis [2, 3].
Driven by the advance of fully convolutional networks (FCNs) [4], human parsing,
or semantic part segmentation, has recently made great progress owing to deeply
learned features [5, 6], large-scale annotations [7, 8], and advanced reasoning over
graphical models [9, 10]. However, previous approaches focus only on the single-
person parsing task in simplified and limited conditions such as fashion pictures
[11–13] with upright poses and diverse daily images [7], and disregard real-world
cases in which multiple person instances appear in one image. Such ill-posed single-
person parsing tasks severely limit the potential application of human parsing to
more challenging scenarios (e.g., group behavior prediction).
In this work, we aim to resolve the more challenging instance-level human pars-
ing task, which needs to not only segment various body parts or clothes but also
associate each part with one instance, as shown in Fig. 6.1. In addition to the diffi-
Fig. 6.1 Examples of large-scale “Crowd Instance-level Human Parsing (CIHP)” dataset, which
contains 38,280 multiperson images with elaborate annotations and high appearance variability
as well as complexity. The images are presented in the first row. The annotations of semantic part
segmentation and instance-level human parsing are shown in the second and third rows, respectively.
Best viewed in color
Fig. 6.2 Two examples show that the errors of the parts and edges of challenging cases can be
seamlessly remedied by the refinement scheme in PGN. In the first row, the segmentation branch
fails to locate the small objects (e.g., the person at the top-left corner and the hand at the bottom-right
corner), but the edge branch detects them successfully. In the second row, the background edges
are mistakenly labeled. However, these incorrect results are rectified by the refinement branch of
the PGN
edge detection under a unified network that first learns shared representation and then
appends two parallel branches for semantic part segmentation and instance-aware
edge detection. As the two targets are highly correlated with each other by sharing
coherent grouping goals, PGN further incorporates a refinement branch to make the
two targets mutually benefit from each other by exploiting complementary contextual
information. This integrated refinement scheme is especially advantageous for chal-
lenging cases because it seamlessly remedies the errors from each target. As shown
in Fig. 6.2, a small person may fail to be localized by the segmentation branch but
maybe successfully detected by the edge branch, or mistakenly labeled background
edges from instance boundaries could be corrected with the refinement algorithm.
Given semantic part segmentation and instance edges, an efficient cutting inference
can be used to generate instance-level human parsing results using a breadth-first
search over line segments obtained by jointly scanning the segmentation and edges
maps.
Furthermore, to the best of our knowledge, there is no available large-scale dataset
for instance-level human parsing research. We introduce a new large-scale dataset,
named Crowd Instance-level Human Parsing (CIHP), that contains 38,280 multiper-
son images with pixel-wise annotations of 19 semantic parts at the instance level. The
dataset is elaborately annotated, focusing on the semantic understanding of multiple
people in the wild, as shown in Fig. 6.1. With the new dataset, we also propose a public
server benchmark to automatically report evaluation results for fair comparison.
Our contributions are summarized as follows. (1) We investigate more challenging
instance-level human parsing, which pushes the research boundary of human parsing
to better match real-world scenarios. (2) A novel part grouping network (PGN) is
proposed to immediately solve multiperson human parsing in a unified network by
72 6 Instance-Level Human Parsing
Human Parsing Recently, many research efforts have been devoted to human pars-
ing [7, 11, 22–24] to advance human-centric analysis. For example, Liang et al.
[24] proposed a novel Co-CNN architecture that integrates multiple levels of image
contexts into a unified network. Gong et al. [7] designed a structure-sensitive learn-
ing to enforce semantic consistency between the produced parsing results and the
human joint structures. However, all these prior works focus only on relatively sim-
ple single-person human parsing without considering the common multiple-instance
cases in the real world.
For the current data resources, we summarize the publicly available datasets for
human parsing in Table 6.1. Previous datasets include only very few person instances
and categories in one image and require prior work to evaluate only pure part segmen-
tation performance while disregarding the instance belongings. In contrast, contain-
ing 38,280 images, the proposed CIHP dataset is the first and most comprehensive
dataset for instance-level human parsing to date. Although a few datasets exist in
the vision community that are dedicated to other tasks, e.g., clothing recognition and
retrieval [25, 26] and fashion modeling [27], our CIHP, which focuses on instance-
level human parsing, is the largest and provides more elaborate dense annotations
for diverse images. A standard server benchmark for our CIHP can facilitate human
analysis research by enabling a fair comparison among current approaches.
Instance-Level Object Segmentation Our target is also highly relevant to the
instance-level object segmentation task that aims to predict a whole mask for each
object in an image. Most of the prior works [15–18, 18, 19] addressed this task by
sequentially optimizing object detection and foreground/background segmentation.
Dai et al. [17] proposed a multiple-stage cascade to unify bounding box proposal
generation, segment proposal generation, and classification. In [14, 28], a CRF was
used to assign each pixel to an object detection box by exploiting semantic segmen-
tation maps. More recently, Mask R-CNN [19] extended the Faster R-CNN detection
framework [29] by adding a branch to predict the segmentation masks of each region
of interest. However, these proposal-based methods may fail to model the interac-
tions among different instances, which are critical for performing more fine-grained
segmentation for each instance in our instance-level human parsing.
Nonetheless, some approaches [3, 20, 21, 30–32] were also proposed to bypass
the object proposal step for instance-level segmentation. In the PFN [3], the clus-
6.2 Related Work 73
Table 6.1 Comparison of the publicly available datasets for human parsing. For each dataset, we
report the number of person instances per image; the total number of images; the separate number
of images in the training, validation, and test sets; and the number of part labels, including the
background
Dataset # Instances/ # Total # Train # Validation # Test Categories
image
Fashionista [23] 1 685 456 – 229 56
PASCAL-Person- 2.2 3,533 1,716 – 1,817 7
Part [13]
ATR [5] 1 17,700 16,000 700 1,000 18
LIP [7] 1 50,462 30,462 10,000 10,000 20
CIHP 3.4 38,280 28,280 5,000 5,000 20
tering of the number of instances and per-pixel bounding boxes was predicted to
produce instance segmentation. In [21], semantic segmentation and object boundary
prediction were exploited to separate instances by a complicated image partitioning
formulation. Similarly, the SGN [20] proposed predicting object breakpoints to cre-
ate line segments, which were then grouped into connected components to generate
object regions. Despite their intuition being similar to ours in grouping regions to
generate an instance, these two pipelines separately learn several subnetworks and
thus obtain the final results by relying on a few independent steps.
Here, we emphasize that this work investigates a more challenging fine-grained
instance-level human parsing task that integrates the current semantic part segmenta-
tion and instance-level object segmentation tasks. From the technical perspective, we
present a novel detection-free part grouping network that unifies and mutually refines
twinned grouping tasks, semantic part segmentation, and instance-aware edge detec-
tion, in an end-to-end way. Without the expensive CRF refinement used in [14], the
final results can then be effortlessly obtained by a simple instance partition process.
dataset, we propose a new benchmark for instance-level human parsing together with
a standard evaluation server, where the test set will be kept secret to avoid overfitting.
The images in the CIHP are collected from unconstrained resources such as Google
and Bing. We manually specify several keywords (e.g., family, couple, party, meeting)
to gain a great diversity of multiperson images. The crawled images are elaborately
annotated by a professional labeling organization with good quality control. We
supervise the entire annotation process and conduct a second-round check for each
annotated image. We remove the unusable images that are of low resolution or image
quality or contain one or no person instances.
In total, 38,280 images are kept to construct the CIHP dataset. Following random
selection, we arrive at a unique split that consists of 28,280 training and 5,000 vali-
dation images with publicly available annotations as well as 5,000 test images with
annotations withheld for benchmarking purposes.
We now introduce the images and categories in the CIHP dataset with more statistical
details. Superior to the previous attempts [7, 13, 24], which average one or two person
instances per image, all images of the CIHP dataset contain two or more instances,
with an average of 3.4. The distribution of the number of persons per image is
illustrated in Fig. 6.3 (left). Generally, we follow LIP [7] to define and annotate the
Fig. 6.3 Left: Statistics on the number of persons in one image. Right: The data distribution of the
19 semantic part labels in the CIHP dataset
6.3 Crowd Instance-Level Human Parsing Dataset 75
semantic part labels. However, we find that the jumpsuit label defined in LIP [7]
is infrequent compared to the other labels. For more complete and precise human
parsing, we use a more common body part label (torso-skin) instead. Therefore, the
19 semantic part labels in the CIHP are hat, hair, sunglasses, upper clothing, dress,
coat, socks, pants, gloves, scarf, skirt, torso-skin, face, right/left arm, right/left leg,
and right/left shoe. The numbers of images for each semantic part label are presented
in Fig. 6.3 (right).
In this section, we present a general pipeline for our approach (see Fig. 6.4) and
then describe each component in detail. The proposed part grouping network (PGN)
jointly trains and refines the semantic part segmentation and instance-aware edge
detection in a unified network. Technically, these two subtasks are both pixel-wise
classification problems, on which fully convolutional networks (FCNs) [4] perform
well. Our PGN is thus constructed based on the FCN structure, which first learns
common representation using shared intermediate layers and then appends two par-
allel branches for semantic part segmentation and edge detection. To explore and
take advantage of the semantic correlation of these two tasks, a refinement branch
is further incorporated to make the two targets mutually beneficial for each other
by exploiting complementary contextual information. Finally, an efficient partition
process with a heuristic grouping algorithm can be used to generate instance-level
Fig. 6.4 Illustration of our part grouping network (PGN). Given an input image, we use ResNet-
101 to extract the shared feature maps. Then, two branches are appended to capture part context and
human boundary context while simultaneously generating part score maps and edge score maps.
Finally, a refinement branch is performed to refine both predicted segmentation maps and edge
maps by integrating part segmentation and human boundary contexts
76 6 Instance-Level Human Parsing
Fig. 6.5 The whole pipeline of our approach to instance-level human parsing. Generated from the
PGN, the part segmentation maps and edge maps are scanned simultaneously to create horizontal
and vertical segmented lines. Similar to a connected graph problem, the breadth-first search can
be applied to group the segmented lines into regions. Furthermore, the small regions near the
instance boundary are merged into their neighbor regions to cover larger areas and several part
labels. Associating the instance maps and part segmentation maps, the pipeline finally outputs a
well-predicted instance-level human parsing result without any proposals from object detection
human parsing results using a breadth-first search over line segments obtained by
jointly scanning the generated semantic part segmentation maps and instance-aware
edge maps.
mation from a shallow, fine layer to produce accurate and detailed segmentation,
we concatenate the activations of the final three blocks of ResNet-101 as the final
extracted feature maps. Owing to the atrous convolution, this information combi-
nation allows the network to make local predictions instructed by global structure
without upscale operation. Second, following PSPNet [33], which exploits the capa-
bility of global context information by different region-based context aggregation,
we use the pyramid pooling module on top of the extracted feature maps before the
final classification layers. The extracted feature maps are average-pooled with four
different kernel sizes, giving us four feature maps with spatial resolutions of 1 × 1,
2 × 2, 3 × 3, and 6 × 6. Each feature map undergoes convolution and upsampling
before they are concatenated with each other. Benefiting from these two coarse-to-
fine schemes, the backbone subnetwork is able to capture contextual information that
has different scales and varies among different subregions.
Semantic Part Segmentation Branch. The common technique [10, 34] for
semantic segmentation is to predict the image at several different scales with shared
network weights and then combine the predictions together with the learned attention
weights. To reinforce the efficiency and generalizability of our unified network, we
discard the multiscale input and apply another context aggregation pattern with var-
ious average-pooling kernel sizes, which is introduced in [33]. We append one side
branch to perform pixel-wise recognition for assigning each pixel to one semantic
part label. The 1 × 1 convolutional classifiers output K channels, corresponding to
the number of target part labels, including a background class.
Instance-Aware Edge Detection Branch. Following [35], we attach side out-
puts for edge detection to the final three blocks of ResNet-101. Deep supervision is
imposed at each side-output layer to learn rich hierarchical representations of edge
predictions. In particular, we use atrous spatial pyramid pooling (ASPP) [10] for the
three edge side output layers to robustly detect boundaries at multiple scales. The
ASPP that we use consists of one 1 × 1 convolution and four 3 × 3 atrous convolu-
tions with dilation rates of 2, 4, 8, and 16. In the final classification layers for edge
detection, we use a pyramid pooling module to collect more global information for
better reasoning. We apply 1 × 1 convolutional layers with one channel for all edge
outputs to generate edge score maps.
Refinement Branch. We design a simple yet efficient refinement branch to jointly
refine segmentation and edge predictions. As shown in Fig. 6.4, the refinement branch
integrates the segmentation and edge predictions back into the feature space by
mapping them to a larger number of channels with an additional 1 × 1 convolution.
The remapped feature maps are combined with the extracted feature maps from both
the segmentation branch and edge branch, which are finally fed into another two
pyramid pooling modules to mutually boost segmentation and edge results.
In summary, the learning objective of the PGN can be written as
N
L = α · (L seg + L seg ) + β · (L edge + L edge + L nside ). (6.1)
n=1
78 6 Instance-Level Human Parsing
The resolution of the output score maps is m × m, which is the same for both seg-
mentation and edge. Thus, the segmentation branch has a K m 2 -dimensional output,
which encodes K segmentation maps of resolution m × m, one for each of the K
classes. During training, we apply a per-pixel softmax and define L seg as the multino-
mial cross-entropy loss. L seg is the same but for the refined segmentation results. For
each m 2 -dimensional edge output, we use a per-pixel sigmoid binary cross-entropy
loss. L edge , L edge , and L nside denote the loss of the first predicted edge, refined edge,
and side-output edge, respectively. In our network, the number of edge side outputs,
N, is 3. α and β are the balance weights.
We use the batch normalization parameters provided by [10], which are fixed
during our training process. Our modules (including the ASPP and pyramid pooling
module) added to ResNet eliminate batch normalization because the whole network
is trained end-to-end with a small batch size due to the limitation of physical memory
on GPU cards. The ReLU activation function is applied following each convolutional
layer except the final classification layers.
Because the tasks of semantic part segmentation and instance-aware edge detection
are able to incorporate all the information required for instance-level human parsing,
we employ a simple instance partition process to obtain the final results during
inference, which groups human parts into instances based on edge guidance. The
entire process is illustrated in Fig. 6.5.
First, inspired by the line decoding process in [20], we simultaneously scan part
segmentation maps and edge maps thinned by nonmaximal suppression [35] to create
horizontal and vertical line segments. To create horizontal lines, we slide from left to
right along each row. The background positions of the segmentation maps are directly
skipped, and a new line starts when we hit a foreground label of segmentation. The
lines are terminated when we hit an edge point, and a new line should start at the
next position. We label each new line with an individual number, so the edge points
can cut off the lines and produce a boundary between two different instances. We
perform similar operations but slide from top to bottom to create vertical lines.
The next step is to aggregate these two types of lines to create instances. We
can treat the horizontal lines and vertical lines jointly as a connected graph. The
points in the same lines can be thought of as connected because they have the same
labeled number. We traverse the connected graph by the breadth-first search to find
connected components. In detail, when visiting a point, we search its connected
neighbors horizontally and vertically and then push them into the queue that stores
the points belonging to the same regions. As a result, the lines of the same instance
are grouped, and different instance regions are separated.
This simple process inevitably introduces errors if there are false edge points
within instances, resulting in many small regions in the area around instance bound-
6.4 Part Grouping Network 79
aries. We further design a grouping algorithm to address this issue. In rethinking the
separated regions, if a region contains several semantic part labels and covers a large
area, it must be a person instance. In contrast, if a region is small and contains only
one part segmentation label, we can certainly judge it to be an erroneously separated
region and then merge it with its neighbor instance region. We treat a region as a
person instance if it contains at least two part labels and covers an area over 30 pixels,
which works best in our experiments.
Following this instance partition process, person instance maps can be generated
directly from semantic part segmentation and instance-aware edge maps.
6.5 Experiments
Training Details: We use the basic structure and network settings provided by
DeepLab-v2 [10]. The 512 × 512 inputs are randomly cropped from the images
during training. The size of the output scope maps, m, equals 64 with a downsam-
pling scale of 1/8. The number of categories, K , is 7 for the PASCAL-Person-Part
dataset [13] and 20 for our CIHP dataset.
The initial learning rate is 0.0001, the parsing loss weight α is 1, and the edge
loss weight β is 0.01. Following [36], we employ a “poly” learning rate policy in
which the initial learning rate is multiplied by (1 − max_iter
iter
)power with power = 0.9.
We train all models with a batch size of 4 images and momentum of 0.9.
We apply data augmentation, including randomly scaling the input images (from
0.5 to 2.0), randomly cropping and randomly left-right flipping during training for
all datasets. As reported in [14], the baseline methods Holistic [14] and MNC [17]
are pretrained on the Pascal VOC Dataset [37]. For fair comparisons, we train the
PGN with the same settings for roughly 80 epochs.
Our method is implemented by extending the TensorFlow framework. All net-
works are trained on four NVIDIA GeForce GTX 1080 GPUs.
Inference: During testing, the resolution of every input is consistent with that of
the original image. We average the predictions produced by the part segmentation
branch and the refinement branch as the final results for part segmentation. For
edge detection, we use only the results of the refinement branch. To stabilize the
predictions, we perform inference by combining the results of the multiscale inputs
and left-right flipped images. In particular, the scale is 0.5 to 1.75 in increments
of 0.25 for segmentation and from 1.0 to 1.75 for edge detection. In the partition
process, we break the lines when the activation of the edge point is larger than 0.2.
Evaluation Metric: The standard intersection over union (IoU) criterion is
adopted for evaluation of semantic part segmentation, following [13]. To evaluate
instance-aware edge detection performance, we use the same measures for traditional
edge detection [38]: fixed contour threshold (ODS) and per-image best threshold
80 6 Instance-Level Human Parsing
Table 6.3 Comparison of A P r at various IoU thresholds for instance-level human parsing on the
PASCAL-Person-Part dataset [13]
Method IoU threshold r
A Pvol
0.5 0.6 0.7
MNC [17] 38.8 28.1 19.3 36.7
Holistic [14] 40.6 30.4 19.1 38.4
PGN (edge + segmentation) 36.2 25.9 16.3 35.6
PGN (w/o refinement) 39.1 29.3 19.5 37.8
PGN (w/o grouping) 37.1 28.2 19.3 38.2
PGN (large-area grouping) 37.6 28.7 19.7 38.6
PGN 39.6 29.9 20.0 39.2
Table 6.4 Performance comparison of edges (left), part segmentation (middle), and instance-level
human parsing (right) from different components of the PGN on the CIHP
Method ODS OIS Mean IoU threshold r
A Pvol
IoU
0.5 0.6 0.7
PGN (edge) + PGN (segmentation) 44.8 44.9 50.7 28.5 22.9 16.4 27.8
PGN (w/o refinement) 45.3 45.6 54.1 33.3 26.3 18.5 31.4
PGN (w/o grouping) – – – 34.7 27.8 20.1 32.9
PGN (large-area grouping) – – – 35.1 28.2 20.4 33.4
PGN 45.5 46.0 55.8 35.8 28.6 20.5 33.6
As there are no available codes for the baseline methods [14], we extensively evaluate
each component of our PGN architecture on the CIHP test set, as shown in Table 6.4.
For part segmentation and instance-level human parsing, the performances on CIHP
is worse than those on PASCAL-Person-Part [13] because the CIHP dataset contains
more instances with more diverse poses, appearance patterns, and occlusions, which
is more consistent with real-world scenarios, as shown in Fig. 6.6. However, the
images in CIHP are high quality with higher resolution, which leads to better edge
detection results.
The qualitative results on the PASCAL-Person-Part dataset [13] and the CIHP dataset
are shown in Fig. 6.6. Compared to the results of Holistic [14], our part segmentation
82 6 Instance-Level Human Parsing
(a)
(b)
Fig. 6.6 Left: Visualized results on the PASCAL-Person-Part dataset [13]. In each group, the first
line shows the input image, segmentation and instance results of Holistic [14] (provided by the
authors), and the results of our PGN are presented in the second line. Right: The images and the
predicted results of edges, segmentation, and instance-level human parsing by our PGN on the CIHP
dataset are presented vertically
and instance-level human parsing results are more precise because the predicted edges
can eliminate the interference from the background, such as the flag in group (a) and
the dog in group (b). Overall, our PGN outputs highly semantically meaningful
predictions owing to the mutual refinement of edge detection and semantic part
segmentation.
References
1. L. Wang, X. Ji, Q. Deng, M. Jia, Deformable part model based multiple pedestrian detection
for video surveillance in crowded scenes, in VISAPP (2014)
2. K. Gong, X. Liang, D. Zhang, X. Shen, L. Lin, Look into person: Self-supervised structure-
sensitive learning and a new benchmark for human parsing, in CVPR (2017)
3. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in CVPR (2016)
4. Q. Li, A. Arnab, P.H. Torr, Holistic, instance-level human parsing. arXiv preprint
arXiv:1709.03612 (2017)
5. B. Hariharan, P. Arbeláez, R. Girshick, J. Malik, Simultaneous detection and segmentation, in
ECCV (2014)
6. X. Liang, Y. Wei, X. Shen, Z. Jie, J. Feng, L. Lin, S. Yan, Reversible recursive instance-level
object segmentation, in CVPR (2016)
7. J. Dai, K. He, J. Sun, Instance-aware semantic segmentation via multi-task network cascades,
in CVPR (2016)
8. P.O. Pinheiro, R. Collobert, P. Dollár, Learning to segment object candidates, in NIPS (2015)
9. K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask r-cnn, in ICCV (2017)
10. S. Liu, J. Jia, S. Fidler, R. Urtasun, Sgn: Sequential grouping networks for instance segmenta-
tion, in ICCV (2017)
11. A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. Rother, Instancecut: from edges to
instances with multicut, in CVPR (2017)
References 83
12. E. Simo-Serra, S. Fidler, F. Moreno-Noguer, R. Urtasun, A High Performance CRF Model for
Clothes Parsing, in ACCV (2014)
13. Z. Liu, P. Luo, S. Qiu, X. Wang, X. Tang, Deepfashion: Powering robust clothes recognition
and retrieval with rich annotations, in CVPR (2016)
14. M. Hadi Kiapour, X. Han, S.L.A.C.B.T.L.B.: Where to buy it:matching street clothing photos
in online shops, in ICCV (2015)
15. Simo-Serra, E., Fidler, S., Moreno-Noguer, F., Urtasun, R.: Neuroaesthetics in fashion: mod-
eling the perception of fashionability, in CVPR (2015)
16. A. Arnab, P.H.S. Torr, Pixelwise instance segmentation with a dynamically instantiated net-
work, in CVPR (2017)
17. S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region
proposal networks, in NIPS (2015)
18. M. Ren, R.S. Zemel, End-to-end instance segmentation with recurrent attention, in CVPR
(2017)
19. M. Bai, R. Urtasun, Deep watershed transform for instance segmentation, in CVPR (2017)
20. B. Romera-Paredes, P.H.S. Torr, Recurrent instance segmentation, in ECCV (2016)
21. H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in CVPR (2017)
22. M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual object
classes (voc) challenge, in IJCV (2010)
23. S. Xie, Z. Tu, Holistically-nested edge detection, in ICCV (2015)
24. L.C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic
image segmentation. arXiv preprint arXiv:1706.05587 (2017)
25. Y. Liu, M.M. Cheng, X. Hu, K. Wang, X. Bai, Richer convolutional features for edge detection,
in CVPR (2017)
26. J. Yang, B. Price, S. Cohen, H. Lee, M.H. Yang, Object contour detection with a fully convo-
lutional encoder-decoder network, in CVPR (2016)
27. C. Gan, M. Lin, Y. Yang, G. de Melo, A.G. Hauptmann, Concepts not alone: Exploring pairwise
relationships for zero-shot video activity recognition, in AAAI (2016)
28. X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, S. Yan, Proposal-free network for instance-level
object segmentation. arXiv preprint arXiv:1509.02636 (2015)
29. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation,
arXiv preprint arXiv:1411.4038 (2014)
30. H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M, Bernstein, A.C. Berg
Olga, Russakovsky, J. Deng, L. Fei-Fei, Imagenet large scale visual recognition challenge
(2015)
31. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. Torr,
Conditional random fields as recurrent neural networks, in ICCV (2015)
32. L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Deeplab: Semantic image
segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv
preprint arXiv:1606.00915 (2016 TPAMI (2015)
33. K. Yamaguchi, M. Kiapour, T. Berg, Paper doll parsing: Retrieving similar styles to parse
clothing items, in ICCV (2013)
34. J. Dong, Q. Chen, W. Xia, Z. Huang, S. Yan, A deformable mixture parsing model with parselets,
in ICCV (2013)
35. X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, et al., Detect what you can: Detecting and
representing objects using holistic models and body parts, in CVPR (2014)
36. T. Lin, M. Maire, S.J. Belongie, L.D. Bourdev, R.B. Girshick, J. Hays, P. Perona, D. Ramanan,
P., C.L. Zitnick, Microsoft COCO: common objects in context. CoRR arXiv:1405.0312 (2014)
37. K. Yamaguchi, M. Kiapour, L. Ortiz, T. Berg, Parsing clothing in fashion photographs, in CVPR
(2012)
38. X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, S. Yan, Human parsing with
contextualized convolutional neural network, in ICCV (2015)
39. L.C. Chen, Y. Yang, J. Wang, W. Xu, A.L. Yuille, Attention to scale: Scale-aware semantic
image segmentation, in CVPR (2016)
40. F. Xia, P. Wang, L.C., A.L. Yuille, Zoom better to see clearer: Huamn part segmentation with
auto zoom net, in ECCV (2016)
Chapter 7
Video Instance-Level Human Parsing
7.1 Introduction
Due to the successful development of fully convolutional networks (FCNs) [1], great
progress has been made in human parsing, or the semantic part segmentation task [2–
8]. However, previous approaches to single-person or multiple-person human parsing
focused only on the static image domain. To bring the research closer to real-world
scenarios, fast and accurate video instance-level human parsing is more desirable
and crucial for high-level applications such as action recognition and object tracking
as well as group behavior prediction.
In this work, we make the first attempt to investigate the more challenging video
instance-level human parsing task, which needs to not only segment various body
parts or clothing but also associate each part with one instance for every frame in the
video, as shown in Fig. 7.1. In addition to the difficulties shared with single-person
parsing (e.g., various appearances, viewpoints, and self-occlusions) and instance-
level parsing (e.g., an uncertain number of instances), video human parsing faces
more challenges that are inevitable in video object detection and segmentation prob-
In this section, we describe our new video instance-level parsing (VIP) dataset in
detail. Sample frames of some of the sequences are shown in Fig. 7.1. To the best of
our knowledge, our VIP is the first large-scale dataset that focuses on comprehensive
human understanding to benchmark the new challenging video instance-level fine-
grained human parsing task. Containing videos collected from real-world scenarios
in which people appear in various poses, from various viewpoints and with heavy
occlusions, the VIP dataset presents the difficulties of the semantic part segmentation
task. Furthermore, it also includes all major challenges typically found in longer video
sequences such as motion blur, camera shake, out-of-view, and scale variation.
Fig. 7.1 Sample sequences from our VIP dataset with ground-truth part segmentation masks over-
laid
7.2 Video Instance-Level Parsing Dataset 87
Our data collection and annotation methodology are carefully designed to capture
the high variability of real-world human activity scenes. The sequences are collected
from YouTube with several specified keywords (e.g., dancing, flash mob) to gain
a wide variety of multiperson videos. All images are meticulously annotated by
professionals. We maintain data quality by manually inspecting and conducting a
second-round check of the annotated data. We remove the unusable images that are
of low resolution and image quality. The length of a video in the dataset ranges from
10 s to 120 s. For every 25 consecutive frames in each video, one frame is densely
annotated with pixel-wise semantic part categories and instance-level identification.
To analyze every detailed region of a person, including different body parts as well
as different clothing styles, following the largest still-image human parsing LIP
dataset [8], we defined 19 usual clothing classes and body parts, hat, hair, sunglasses,
upper clothing, dress, coat, socks, pants, gloves, scarf, skirt, torso-skin, face, right/left
arm, right/left leg, and right/left shoe, for annotation. Additionally, the annotated
frames of our VIP dataset, with an average of 2.93 person instances per image, are
superior to the previous attempts [3, 8, 9], which average one or two person instances
per image.
Fig. 7.2 An overview of our ATEN approach, which performs adaptive temporal encoding over
key frames and flow-guided feature propagation for consecutive frames among key frames. Each
key frame (blue) is fed into a temporal encoding module that memorizes the temporal information
of its former key frames. To alleviate the computational cost, the features of consecutive frames
(green) between two key frames can be produced by the flow-guided propagation module from the
nearest key frame. Then, all feature maps of all frames are fed to the Parsing-RCNN to generate
the instance-level human parsing results
As shown in Fig. 7.2, our ATEN approach based on the Parsing-RCNN aims to
balance efficiency and accuracy by applying flow-guided feature propagation and
adaptive temporal encoding. We divide each video sequence into several segments
of equal length l, Seg = [I jl , I jl+1 , ..., I( j+1)∗l−1 ]. Only one frame in each segment
is selected to be a key frame (using the median frame as the default). Given a key
frame Ik , the encoded feature is denoted as
Subsequently, the feature of a non-key frame It is propagated from the nearest key
frame Ik , which is denoted as
where M and S are the flow field and scale field, respectively. Finally, the instance-
level human parsing subnetwork N par se is applied to both encoded key frame feature
maps and warped non-key frame feature maps to compute the eventual result R =
N par se (F).
As shown in Fig. 7.3, given an encoding range p, which specifies the range of
the former key frames for encoding ( p = 2 by default), we first apply the embedded
FlowNet F [10] to individually estimate p flow fields and scale fields, which are used
7.3 Adaptive Temporal Encoding Network 89
Fig. 7.3 Our adaptive temporal encoding module. For each key frame K , we first obtained warped
feature maps from two previous key frames (i.e., K − 1 and K − 2) via a flow-guided propagation
module. Then, the warped features and current appearance features are consecutively fed to con-
vGRU for temporal encoding. All feature maps in this module have the same shape (stride of 4,256
dimensions)
for warping (as illustrated in Sect. 7.3.1) p former key frames to current key frames.
After feature warping, each warped feature is consecutively fed to convGRU for
temporal coherence feature encoding. We use the last state of GRU as the encoded
feature.
F k = convG RU (Fk− p→k , ..., Fk−1→k , Fk ). (7.4)
Motivated by [12, 13], given a reference frame I j and a target Ii frame, an optical
flow field is calculated by the embedded FlowNet F [10, 14] to obtain a pixel-wise
motion path. Extending the FlowNet with a scale field that has the same spatial and
channel dimensions as the feature maps helps to improve the flow warping accuracy.
The feature propagation function is defined as
where F j denotes the deep feature of reference frame I j . W denotes the bilinear
sampler function. denotes element-wise multiply. F represents the flow estimation
function, and S is the scale field that refines the warped feature. FlowNet-S [10] is
adopted as the flow estimation function and pretrained on the FlyingChairs dataset.
A scale map with the same dimensions as the target features is predicted in parallel
with the flow field by FlowNet via an additional 1×1 convolutional layer attached
to the top feature of the flow network. The weights of the extra 1×1 convolutional
layer are initialized with zeros. The biases are initialized with ones and frozen during
the training phase. The whole process is fully differentiable, which has been clearly
described in [12, 13].
Fig. 7.4 Our Parsing-RCNN module for instance-level human parsing. Feature maps extracted by
the backbone network are simultaneously passed through the instance-level human segmentation
branch and global human parsing branch, and the results are then integrated to obtain the final
instance-level human parsing results by taking the union of all parts assigned to a particular human
instance
Formally, during training, we define a multitask loss on both the whole image and
each ROI as
L = L par sing + Lcls + Lbox + Lmask . (7.7)
L par sing is the image global parsing loss, which is defined as softmax cross-entropy
loss. Specifically, Lcls , Lbox , and Lmask are calculated on each ROI. The global
parsing branch and the instance-level human segmentation branch are jointly trained
to minimize L by stochastic gradient descent (SGD).
Training. Our ATEN is fully differentiable and can be trained end-to-end. A standard
image domain method can be transferred to video tasks by selecting a proper task-
specific subnetwork. During the training phase, in each minibatch, video frames
{Ik− p , ..., Ik , It }, −l/2 ≤ t − k < l − l/2 are randomly sampled and fed to the
network. In the forward pass, N f eat is applied on frames except for It . After obtaining
the encoded feature F k , feature F k is propagated to F t . Otherwise, the feature maps
are identical and passed through N par se directly. Finally, N par se is applied on F t or
F k . Because all the components are differentiable, the multitask loss, as illustrated
in Eq. 7.7, can backpropagate to all subnetworks to optimize task performance.
Inference. Algorithm 1 summarizes the inference algorithm. Given a video frame
sequence I , a segment length l, and an encoding range p, the proposed method
92 7 Video Instance-Level Human Parsing
sequentially processes each segment. Only one frame is selected as the key frame in
each segment. A fully convolutional network is applied on key frame Ik to extract
feature Fk . Then, it searches the p former key frames and feeds them into the adaptive
temporal encoding module with the current key frame. When there are not enough
former key frames, the p latter key frames are selected instead. Subsequently, these
key frames are warped to the current key frame via a flow-guided propagation module
and consecutively fed to convGRU for temporal coherence feature encoding. With
the encoded feature F k , other non-key frames F t featured in this segment can be
obtained by the flow-guided feature propagation module. Finally, the Parsing-RCNN
module is applied on F k or F t to obtain instance-level parsing results.
Regarding runtime complexity, the ratio of our method versus the single-frame
baseline is as follows:
(l + p) × O(F ) 1
r= + < 1. (7.9)
l × N f eat l
7.3 Adaptive Temporal Encoding Network 93
In fact, the encoding range p is smaller (e.g., 1, 2), and the backbone fully convo-
lutional network has higher time complexity than FlowNet. Our approach with high
accuracy achieves a faster speed than the per-frame baseline.
References
Person verification involves person reidentification and face recognition (in this
chapter, we focus on face verification in different modalities, i.e., faces from still
images and videos, older and younger faces, and sketch and photo portraits).
Person reidentification (ReID), which aims to match pedestrian images across
multiple nonoverlapping cameras, has attracted increasing attention in surveillance.
Most recent works can be categorized into three groups: (1) extracting invariant and
discriminant features [1–4], (2) learning a robust metric or subspace for matching
[1, 5–8], and (3) joint learning of the above two methods [9–11]. Recently, deep
learning [4] and video-based models [12] have also been introduced for ReID. There
are also works on the generalization of ReID, e.g., [13, 14]. Recently, GAN [15] was
also introduced to boost the performance of ReID. Zheng et al. [16] adopts DCGAN
in unlabeled data generation and effectively improves the discriminative ability of the
baseline. Zhong et al. [17] propose two camera-style adaptation methods for same-
source mapping and unsupervised domain adaptation. Deng et al. [18] introduce
similarity-preserving GAN (SPGAN) to learn image transition from the source to
the target domain in an unsupervised manner. Despite considerable efforts, ReID is
still an open problem due to the dramatic variations in viewpoint and pose changes.
Despite the great advances in face-related research in recent years, face recognition
across age remains a challenging problem. The challenges include large intrasubject
variation and great intersubject similarity [19]. The human facial appearance changes
greatly with the aging process. From birth to adulthood, the greatest change is cran-
iofacial growth, which involves a change in shape; from adulthood to old age, the
most perceptible change is skin aging, which involves a texture change [20]. Such
changes in the same person are intrasubject variations. Meanwhile, different persons
in the same age period may look similar, which is intersubject similarity. Therefore,
reducing intrasubject variations while increasing intersubject differences is a crucial
goal in metric-based age-invariant recognition. Several traditional approaches, such
as linear discriminant analysis (LDA) [21], Bayesian face recognition [22, 23], met-
ric learning [24], and recent deep learning methods [25], have realized this goal for
general face recognition.
Sketch-photo face verification is an interesting yet challenging task that aims to
verify whether a photo of a face and a drawn sketch of a face both portray the same
individual. This task has an important application in assisting law enforcement, i.e.,
96 Part IV: Identifying and Verifying Persons
using a face sketch to find candidate face photos. However, it is difficult to match
photos and sketches in two different modalities. For example, hand-drawing may
create unpredictable facial distortion and variation compared to a photo, and face
sketches often lack details that can be important cues for preserving identity. Many
attempts have been made to verify faces between sketches and photos. For example, a
local-based strategy proposed by Xiao et al. [26] was based on the embedded hidden
Markov model (E-HMM). The researchers transformed the sketches into pseudo-
photos and applied the eigenface algorithm for recognition. Zhang et al. [27] added
a refinement step to the existing approaches by applying a support vector regression
(SVR)-based model to synthesize high-frequency information. Similarly, Gao et al.
[28] proposed a new method called SNS-SRE with two steps, i.e., sparse neigh-
bor selection (SNS) to obtain an initial estimation and sparse-representation-based
enhancement (SRE) for further improvement. To capture person identity during the
photo-sketch transformation, [29] defined an optimization objective in the form of
joint generative-discriminative minimization. In particular, a discriminative regular-
ization term is incorporated into the photo-sketch generation, enhancing the discrim-
inability of the generated person sketches in relation to sketches of other individuals
and thus boosting the capacity of both photo-sketch generation and face-sketch ver-
ification.
Matching person faces across still images and videos is a newly emerging task in
intelligent visual surveillance. In these applications, still images (e.g., ID photos) are
usually captured in a controlled environment, while faces in surveillance videos are
filmed in complex scenarios (e.g., with various lighting conditions and occlusions and
in low resolutions). Several cross-domain methods have been proposed to address
the still-to-video face recognition problem [30]. However, their performances are
still poor.
References
8.1 Introduction
machine (DAM) [7] for multiple source domain adaptation, but this approach requires
a set of pretrained base classifiers.
Various discriminative common space approaches have been developed by uti-
lizing label information. Supervised information can be employed by the Rayleigh
quotient [2], which treats the label as the common space [8], or by employing the
max-margin rule [9]. Using the SCDL framework, structured group sparsity has
been adopted to utilize label information [5]. The generalization of discriminative
common space to multiple views has also been studied [10]. Kan et al. propose a
multiview discriminant analysis (MvDA) [11] method to obtain a common space for
multiple views by optimizing both the interview and intraview Rayleigh quotients.
In [12], a method to learn to shape models using local curve segments with multiple
types of distance metrics is proposed.
For most existing multiview analysis methods, the target is defined based on
the standard inner product or distance between the samples in the feature space. In
the field of metric learning, several generalized similarity/distance measures have
been studied to improve recognition performance. In [13, 14], the generalized dis-
tance/similarity measures are formulated as the difference between the distance com-
ponent and the similarity component to take into account both the cross-inner-product
term and two norm terms. Li et al. [15] adopt the second-order decision function as
a distance measure without considering the positive semidefinite (PSD) constraint.
Chang and Yeung [16] suggest an approach to learning locally smooth metrics using
local affine transformations while preserving the topological structure of the origi-
nal data. These distance/similarity measures, however, were developed for matching
samples from the same domain, and they cannot be directly applied to cross-domain
data matching.
To extend traditional single-domain metric learning, Mignon and Jurie [17] sug-
gest a cross-modal metric learning (CMML) model, which learns domain-specific
transformations based on a generalized logistic loss. Zhai et al. [18] incorporate joint
graph regularization into a heterogeneous metric learning model to improve the cross-
media retrieval accuracy. In [17, 18], Euclidean distance is adopted to measure the
dissimilarity in the latent space. Instead of explicitly learning domain-specific trans-
formations, Kang et al. [19] learn a low-rank matrix to parameterize the cross-modal
similarity measure by the accelerated proximal gradient (APG) algorithm. However,
these methods are based mainly on common similarity or distance measures, and
none of them addresses the feature learning problem in cross-domain scenarios.
Instead of using handcrafted features, learning feature representations and contex-
tual relations with deep neural networks, especially the convolutional neural network
(CNN) [20], have shown great potential in various pattern recognition tasks such as
object recognition [21] and semantic segmentation [22]. Significant performance
gains have also been achieved in face recognition [23] and person reidentification
[24–27] that are mainly attributable to the progress in deep learning. Recently, several
deep CNN-based models have been explored for similarity matching and learning.
For example, Andrew et al. [28] propose a multilayer CCA model consisting of
several stacked nonlinear transformations. Li et al. [29] learn filter pairs via deep
networks to handle misalignment and photometric and geometric transformations
8.1 Introduction 101
and achieve promising results for the person reidentification task. Wang et al. [30]
learn fine-grained image similarity with a deep ranking model. Yi et al. [31] present
a deep metric learning approach by generalizing the Siamese CNN. Ahmed et al.
[25] propose a deep convolutional architecture to measure the similarity between a
pair of pedestrian images. In addition to the shared convolutional layers, their net-
work includes a neighborhood difference layer and a patch summary layer to compute
cross-input neighborhood differences. Chen et al. [26] propose a deep ranking frame-
work to learn the joint representation of an image pair and return the similarity score
directly in which the similarity model is replaced by full connection layers.
Our deep model is partially motivated by the above works, but we target a more
powerful solution to cross-domain visual matching by incorporating a generalized
similarity function into deep neural networks. Moreover, our network architecture is
different from those presented in the existing works, leading to new state-of-the-art
results for several challenging person verification and recognition tasks.
Cross-domain visual data matching, e.g., matching persons across ID photos and
surveillance videos, is one of the fundamental problems in many real-world vision
tasks. Conventional approaches to this problem usually involve two steps: (i) pro-
jecting samples from different domains into a common space and (ii) computing
(dis)similarity in this space based on a certain distance. In this section, we present a
novel pairwise similarity measure that advances the existing models by (i) expand-
ing traditional linear projections into affine transformations and (ii) fusing affine
Mahalanobis distance and cosine similarity in a data-driven combination. More-
over, we unify our similarity measure with feature representation learning via deep
convolutional neural networks. Specifically, we incorporate the similarity measure
matrix into the deep architecture, enabling an end-to-end method of model optimiza-
tion. We extensively evaluate our generalized similarity model in several challenging
cross-domain matching tasks: person reidentification in different views and face ver-
ification in different modalities (i.e., faces from still images and videos, older and
younger faces, and sketch and photo portraits). The experimental results demonstrate
the superior performance of our model compared to other state-of-the-art methods.
(1) Samples from different modalities are first projected into a common space
by learning a transformation. The computation may be simplified by assuming that
these cross-domain samples share the same projection.
(2) A certain distance is then utilized to measure the similarity in the projection
space. Usually, Euclidean distance or inner product distance is used.
Suppose that x and y are two samples of different modalities, and U and V are
two projection matrices applied to x and y, respectively. Ux and Vy are usually
formulated as linear similarity transformations mainly for convenient optimization.
A similarity transformation has a useful property of preserving the shape of an
object that undergoes this transformation, but it is limited in capturing complex
deformations that usually exist in various real problems, e.g., translation, shearing,
and composition. On the other hand, Mahalanobis distance, cosine similarity, and
combinations of the two have been widely studied in the research on similarity metric
learning, but how to unify feature learning and similarity learning, in particular, how
to combine Mahalanobis distance with cosine similarity and integrate the distance
metric into deep neural networks for end-to-end learning, remains less investigated.
To address the above issues, in this work, we present a more general similarity
measure and unify it with deep convolutional representation learning. One of the key
innovations is that we generalize two aspects of the existing similarity models. First,
we extend the similarity transformations Ux and Vy to the affine transformations
by adding a translation vector to them, i.e., replacing Ux and Vy with LA x + a
and LB y + b, respectively. Affine transformation is a generalization of similarity
transformation without the requirement of preserving the original point in a linear
space, and it is able to capture more complex deformations. Second, in contrast to the
traditional approaches that choose either Mahalanobis distance or cosine similarity,
we combine these two measures in the affine transformation. This combination is
realized in a data-driven fashion, as discussed in the Appendix, resulting in a novel
generalized similarity measure, defined as
⎡ ⎤⎡ ⎤
A C d x
S(x, y) = [xT yT 1] ⎣CT B e ⎦ ⎣y⎦ , (8.1)
dT eT f 1
(a)
(b)
Fig. 8.1 Illustration of the generalized similarity model. Conventional approaches project data by
simply using linear similarity transformations (i.e., U and V), as illustrated in (a), where Euclidean
distance is applied as the distance metric. As illustrated in (b), we improve the existing models by
(i) expanding the traditional linear similarity transformation into an affine transformation and (ii)
fusing Mahalanobis distance and cosine similarity. The case in (a) is a simplified version of our
model. Please refer to the Appendix for the deduction details
formed in the original data space or in a predefined feature space; that is, the feature
extraction and the similarity measure are studied separately. These methods may have
several drawbacks in practice. For example, the similarity models rely heavily on fea-
ture engineering and thus lack generalizability when applied to problems in different
scenarios. Moreover, the interaction between the feature representations and similar-
ity measures is ignored or simplified, thus limiting their performance. Meanwhile,
deep learning, especially the convolutional neural network (CNN), has demonstrated
its effectiveness in learning discriminative features from raw data and has benefited
from building end-to-end learning frameworks. Motivated by these works, we build
a deep architecture to integrate our similarity measure into CNN-based feature repre-
sentation learning. Our architecture takes raw images from different modalities as the
inputs and automatically produces representations of these images by sequentially
stacking shared subnetworks upon domain-specific subnetworks. Upon these layers,
we further incorporate the components of our similarity measure by stimulating them
with several appended structured neural network layers. The feature learning and the
similarity model learning are thus integrated for end-to-end optimization.
104 8 Person Verification
According to the discussion in Sect. 8.2, our generalized similarity measure extends
the traditional linear projection and integrates Mahalanobis distance and cosine sim-
ilarity into a generic form, as shown in Eq. (8.1). As shown in the Appendix, A
and B in our similarity measure are positive semidefinite, but C does not obey this
constraint. Hence, we can further factorize A, B and C as follows:
A = LA T LA ,
B = LB T LB , (8.2)
xT y
C= −LC LC .
Moreover, our model extracts feature representation (i.e., f1 (x) and f2 (y)) from
the raw input data by utilizing the CNN. Incorporating the feature representation and
the above matrix factorization into Eq. (8.1), we thus obtain the following similarity
model:
S̃(x, y) = ⎡ S(f1⎤(x),
⎡ f2 (y))⎤
A C d f1 (x)
= [f1 (x)T f2 (y)T 1] ⎣CT B e ⎦ ⎣f2 (y)⎦
(8.3)
dT eT f 1
= LA f1 (x) + LB f2 (y)2 + 2dT f1 (x)
2
y
− 2(LC x
f1 (x))T (LC f2 (y)) + 2eT f2 (y) + f.
Specifically, LA f1 (x), LC x
f1 (x), and dT f1 (x) can be regarded as the similarity
y
components for x, while LB f2 (y), LC f2 (y), and dT f2 (y) correspondingly represent
y. These similarity components are modeled as the weights that connect the neurons
of the last two layers. For example, a portion of the output activations represents
LA f1 (x) by taking f1 (x) as the input and multiplying the corresponding weights LA .
Below, we discuss the formulation of our similarity learning.
The objective of our similarity learning is to seek a function S̃(x, y) that satisfies
a set of similarity/dissimilarity constraints. Instead of learning a similarity func-
tion in a handcrafted feature space, we take the raw data as input and introduce a
deep similarity learning framework to integrate nonlinear feature learning and gen-
eralized similarity learning. Recall that our deep generalized similarity model is
shown in Eq. (8.1). (f1 (x) and f2 (y)) are the feature representations for samples
from different modalities, and we use W to indicate their parameters. We denote
y
= (LA , LB , LC x
, LC , d, e, f ) as the similarity components for sample matching.
Note that S̃(x, y) is asymmetric, i.e., S̃(x, y) = S̃(y, x). This is reasonable for cross-
domain matching because the similarity components are domain-specific.
Assume that D = {({xi , yi }, i )}i=1 N
is a training set of cross-domain sample pairs,
where {xi , yi } denotes the ith pair, and i denotes the corresponding label of {xi , yi }
indicating whether xi and yi are from the same class:
8.2 Generalized Similarity Measures 105
−1, c(x) = c(y)
i = (xi , yi ) = , (8.4)
1, otherwise
where c(x) denotes the class label of the sample x. An ideal deep similarity model
is expected to satisfy the following constraints:
< −1, if i = −1
S̃(xi , yi ) (8.5)
≥ 1, otherwise
To improve the stability of the solution, some regularizers are also introduced, result-
ing in our deep similarity learning model:
N
ˆ = arg min
(Ŵ, ) (1 − i S̃(xi , yi ))+ + (W, ), (8.7)
W,
i=1
One can observe that F(x, y) = S(x, y) when we set B = A and e = d in our model.
It should be noted that LADF treats x and y using the same metrics, i.e., A for both
xT Ax and yT Ay, and d for dT x and dT y. Such a model is reasonable for matching
samples with the same modality but may be unsuitable for cross-domain matching
where x and y are from different modalities. Compared with LADF, our model uses
A and d to calculate xT Ax and dT x and uses B and e to calculate yT By and eT y,
making our model more effective for cross-domain matching.
In [13], Chen et al. extend the classical Bayesian face model by learning a joint dis-
tribution (i.e., intraperson and extraperson variations) of sample pairs. Their decision
function is expressed in the following form:
Note that the similarity metric model proposed in [14] adopts a similar form. Inter-
estingly, this decision function is also a special variant of our model if we set B = A,
C = −G, d = 0, e = 0, and f = 0.
In summary, our similarity model can be regarded as a generalization of many
existing cross-domain matching and metric learning models; therefore, it is more
flexible and suitable than those models for cross-domain visual data matching.
In this section, we introduce our deep architecture that integrates the generalized
similarity measure with convolutional feature representation learning.
As discussed above, our model defined in Eq. (8.7) jointly addresses similarity func-
tion learning and feature learning. This integration is achieved by building a deep
architecture of convolutional neural networks, which is illustrated in Fig. 8.2. It is
worth mentioning that our architecture is able to handle input samples from different
modalities with unequal numbers, e.g., 20 samples of x and 200 samples of y are fed
into the network as a batch process.
From left to right in Fig. 8.2, two domain-specific subnetworks, g1 (x) and g2 (y),
are applied to samples from two different modalities. Then, the outputs of g1 (x) and
g2 (y) are concatenated into a shared subnetwork f(·). We superpose g1 (x) and g2 (y)
to feed f(·). At the output of f(·), the feature representations of the two samples
are extracted separately as f1 (x) and f2 (y), as indicated by the slice operator in
Fig. 8.2. Finally, these learned feature representations are utilized in the structured
fully connected layers that incorporate the similarity components defined in Eq. (8.3).
Below, we introduce the detailed setting of the three subnetworks.
8.3 Joint Similarity and Feature Learning 107
Fig. 8.2 Deep architecture of our similarity model. This architecture comprises three parts: a
domain-specific subnetwork, a shared subnetwork and a similarity subnetwork. The first two parts
extract feature representations from samples from different domains, which are built upon a number
of convolutional layers, max-pooling operations, and fully connected layers. The similarity subnet-
work contains two structured fully connected layers that incorporate the similarity components in
Eq. (8.3)
In this section, we discuss the learning method for our similarity model training. To
avoid loading all images into memory, we use the minibatch learning approach; that
is, in each training iteration, a subset of the image pairs is fed into the neural network
for model optimization.
For notation simplicity in discussing the learning algorithm, we start by introduc-
ing the following definitions:
x̃ = [ LA f1 (x) LC
x
f1 (x) dT f1 (x) ]T ,
y
(8.10)
ỹ = [ LB f2 (y) LC f2 (y) eT f2 (y) ]T ,
where x̃ and ỹ denote the output layer activation of samples x and y. Prior to incor-
porating Eq. (8.10) into the similarity model in Eq. (8.3), we introduce three trans-
formation matrices (using Matlab representations):
P1 = Ir ×r 0r ×(r +1) ,
P2 = 0r ×r Ir ×r 0r ×1 , (8.11)
T
p3 = 01×2r 11×1 ,
where r equals the dimension of the output of the shared neural network (i.e., the
dimensions of f (x) and f (y)), and I indicates the identity matrix. Then, our similarity
model can be rewritten as
By incorporating Eq. (8.12) into the loss function Eq. (8.6), we obtain the follow-
ing objective:
G(W, ; D)
N
= { 1 − i [ (P1xi )T P1
xi + (P1 yi − ,
yi )T P1 (8.13)
i=1
2(P2
xi )T P2
yi + 2p3T yi + f ] }+
xi + 2p3T
8.3 Joint Similarity and Feature Learning 109
where the summation term denotes the hinge-like loss for the cross-domain sample
pair {x̃i , ỹi }, N is the total number of pairs, W represents the feature representation of
different domains and represents the similarity model. W and are both embedded
as weights connecting neurons of layers in our deep neural network model, as Fig. 8.2
illustrates.
The objective function in Eq. (8.13) is defined in sample-pair-based form. To
optimize it using SGD, a certain scheme should be applied to generate mini-
batches of the sample pairs, which is usually associated with high computation
and memory costs. Note that the sample pairs in training set D are constructed
from the original set of samples from different modalities Z = {{X}, {Y}}, where
X = {x1 , ..., x j , ..., x Mx } and Y = {y1 , ..., y j , ..., y My }. The superscript denotes the
sample index in the original training set, e.g., x j ∈ X = {x1 , ..., x j , ..., x Mx } and
y j ∈ Y = {y1 , ..., y j , ..., y My }, while the subscript denotes the index of the sample
pairs, e.g., xi ∈ {xi , yi } ∈ D. Mx and My denote the total number of samples from dif-
ferent domains. Without loss of generalizability, we define z j = x j and z Mx + j = y j .
For each pair {xi , yi } in D, we have z ji,1 = xi and z ji,2 = yi with 1 ≤ ji,1 ≤ Mx and
Mx + 1 ≤ ji,2 ≤ Mz (= Mx + My ). We also have z ji,1 =
xi and
z ji,2 =
yi .
Therefore, we rewrite Eq. (8.13) in a sample-based form:
L(W, ; Z)
N
= { 1 − i [ (P1
z ji,1 )T P1
z ji,1 + (P1 z ji,2 − ,
z ji,2 )T P1 (8.14)
i=1
2(P2
z ) P2
ji,1 T
z ji,2
+ 2p3T
z ji,1 + 2p3T
z ji,2 + f ] }+
Given = (W, ), the loss function in Eq. (8.7) can also be rewritten in the sample-
based form:
H () = L(; Z) + (). (8.15)
∂
=−α H (), (8.16)
∂
where α denotes the learning rate. The key problem of solving the above equation is
∂
calculating ∂ L(). As discussed in [32], there are two ways to achieve this solution,
i.e., pair-based gradient descent and sample-based gradient descent. Here, we adopt
the latter to reduce the computation and memory costs.
Suppose a minibatch of training samples {z j1,x , ..., z jnx ,x , z j1,y , ..., z jny ,y } from the
original set Z, where 1 ≤ ji,x ≤ Mx and Mx + 1 ≤ ji,y ≤ Mz . Following the chain
rule, calculating the gradient for all pairs of samples is equivalent to summing up the
gradient for each sample:
110 8 Person Verification
∂ ∂ L ∂ z̃ j
L() = , (8.17)
∂ j
∂ z̃ j ∂
The calculation of ∂ ∂z̃ jLi,y can be conducted similarly. The algorithm for calculating the
partial derivative of output layer activation for each sample is shown in Algorithm
8.1.
Algorithm 8.1: Calculate the derivative of the output layer activation for each
sample
Input:
The output layer activation for all samples
Output:
The partial derivatives of the output layer activation for all the samples
1: for each sample z j do
2: Initialize the partner set M j containing the sample z j with M j = ∅;
3: for each pair {xi , yi } do
4: if pair {xi , yi } contains the sample z j then
5: if pair {xi , yi } satisfies i S̃(xi , yi ) < 1 then
6: Mi ← {Mi , the corresponding partner of z j in {xi , yi }};
7: end if
8: end if
9: end for
10: Compute the derivatives for the sample z j with all the partners in M j , and sum these
derivatives to be the desired partial derivative for sample z j ’s output layer activation
using Eq. (8.18);
11: end for
Note that all three subnetworks in our deep architecture are differentiable. We
can easily use the backpropagation procedure [20] to compute the partial derivatives
with respect to the hidden layers and model parameters . We summarize the overall
procedure of deep generalized similarity measure learning in Algorithm 8.2.
If all possible pairs are used in training, the sample-based form allows us to
generate n x × n y sample pairs from a minibatch of n x + n y . On the other hand,
the sample-pair-based form may require 2n x n y samples or less to generate n x ×
8.3 Joint Similarity and Feature Learning 111
n y sample pairs. In gradient computation, from Eq. (8.18), for each sample, we
only require calculating P1T P1 z̃ ji,x once and P2T P2 z̃ ji,y n y times in the sample-based
form. In the sample-pair-based form, P1T P1 z̃ ji,x and P2T P2 z̃ ji,y should be computed n x
and n y times, respectively. In sum, the sample-based form generally results in less
computation and memory cost.
8.4 Experiments
(a) 1 (b) 1
0.8 0.8
identification rate
identification rate
0.6 0.6
0.4 0.4
0.2 0.2
0 0
5 10 15 20 25 30 5 10 15 20 25 30
rank rank
Fig. 8.3 CMC curves on a the CUHK03 [29] dataset and b the CUHK01 [33] dataset for evaluating
person reidentification. Our method has superior performance compared to existing state-of-the-art
methods
The results are reported in Fig. 8.3a. It is encouraging that our approach signifi-
cantly outperforms the competing methods (e.g., improving the state-of-the-art rank-
1 accuracy from 54.74% (IDLA [25]) to 58.39%). Among the competing methods,
ITML [4], LDM [34], LMNN [35], RANK [36], KML [24], SDALF [37], KISSME
[38], and eSDC [39] are all based on handcrafted features. The superiority of our
approach in comparison to these methods should be attributed to the deployment of
both deep CNN features and the generalized similarity model. DRSCH [40], DFPNN
[29], and IDLA [25] adopt CNNs for feature representation, but their matching met-
rics are defined based on traditional linear transformations.
Results on CUHK01. Figure 8.3b shows the results of our method and of the
competing approaches on CUHK01. In addition to the methods used on CUHK03, an
additional method, i.e., LMLF [27], is used in the comparison experiment. LMLF [27]
learns midlevel filters from automatically discovered patch clusters. According to the
quantitative results, our method achieves a new state-of-the-art level of performance
with a rank-1 accuracy of 66.50%.
References
1. L. Lin, G. Wang, W. Zuo, X. Feng, L. Zhang, Cross-domain visual matching via generalized
similarity measure and feature learning, in IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 39, no. 6, pp. 1089–1102, 1 June 2017
2. D. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with
application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
3. A. Sharma, D.W. Jacobs, Bypassing synthesis: Pls for face recognition with pose, low-
resolution and sketch. Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 593–600 (2011)
References 113
26. S.-Z. Chen, C.-C. Guo, J.-H. Lai, Deep ranking for person re-identification via joint represen-
tation learning. Arxiv, arXiv:1505.06821 (2015)
27. R. Zhao, W. Ouyang, X. Wang, Learning mid-level filters for person re-identification, in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 144–151
(2014)
28. G. Andrew, R. Arora, J. Bilmes, K. Livescu, Deep canonical correlation analysis, in Proceedings
IEEE the 30th Int’l Conference Machine Learning, pp. 1247–1255 (2013)
29. W. Li, R. Zhao, T. Xiao, X. Wang, Deepreid: deep filter pairing neural network for person
re-identification, in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 152–159 (2014)
30. J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, Y. Wu, Learning fine-
grained image similarity with deep ranking, in Proceedings IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1386–1393 (2014)
31. D. Yi, Z. Lei, and S. Z. Li, Deep metric learning for practical person re-identification, arXiv
preprint arXiv:1407.4979 (2014)
32. S. Ding, L. Lin, G. Wang, H. Chao, Deep feature learning with relative distance comparison
for person re-identification. Pattern Recogn. 48(10), 2993–3003 (2015)
33. W. Li, R. Zhao, X. Wang, Human reidentification with transferred metric learning. In Proceed-
ings Asian Conference on Computer Vision, pp. 31–44 (2012)
34. M. Guillaumin, J. Verbeek, C. Schmid, Is that you? metric learning approaches for face identifi-
cation, in ICCV 2009-International Conference on Computer Vision (IEEE, 2009), pp. 498–505
35. K.Q. Weinberger, J. Blitzer, L.K. Saul, Distance metric learning for large margin nearest neigh-
bor classification, in Advances in Neural Information Processing Systems, pp. 1473–1480
(2005)
36. B. McFee, G.R. Lanckriet, Metric learning to rank, in Proc. Int’l Conference on Computer
Learning, pp. 775–782 (2010)
37. M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person re-identification by
symmetry-driven accumulation of local features, in Computer Vision and Pattern Recogni-
tion (CVPR), 2010 IEEE Conference on (IEEE, 2010), pp. 2360–2367
38. M. Kostinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale metric learning from
equivalence constraints, in Proceedings IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2288–2295 (2012)
39. R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-identification, in
CVPR (2013)
40. R. Zhang, L. Lin, R. Zhang, W. Zuo, L. Zhang, Bit-scalable deep hashing with regularized
similarity learning for image retrieval and person re-identification. IEEE Trans. Image Process.
24(12), 4766–4779 (2015)
Chapter 9
Face Verification
Abstract This chapter introduces a novel cost-effective framework for face identifi-
cation that progressively maintains a batch of classifiers with an increasing number of
facial images of different individuals. By naturally combining two recently emerging
techniques, active learning (AL) and self-paced learning (SPL), the proposed frame-
work is capable of automatically annotating new instances and incorporating them
into the training with weak expert recertification. The advantages of this proposed
framework are twofold: (i) the required number of annotated samples is significantly
decreased, while comparable performance is guaranteed, and user effort is also dra-
matically reduced compared to other state-of-the-art active learning techniques, and
(ii) the mixture of SPL and AL effectively improves not only the classifier accuracy
but also the robustness against noisy data compared to the existing AL/SPL methods
(c
[2019] IEEE. Reprinted, with permission, from [1]).
9.1 Introduction
With the increase in mobile phones, cameras, and social networks, a large number of
photographs has rapidly been created, especially those containing people’s faces. To
interact with these photos, there have been increasing demands for intelligent sys-
tems (e.g., content-based personal photo search-and-share applications using mobile
albums or social networks) with face recognition techniques [2–4]. Owing to several
recently proposed pose/expression normalization and alignment-free approaches [5–
7], identifying faces in the wild has achieved remarkable progress. Regarding com-
mercial products, the website “Face.com” has provided an application interface
(API) to automatically detect and recognize faces in photos. The main problem in
such scenarios is identifying individuals from images in a relatively unconstrained
environment. Traditional methods usually address this problem by supervised learn-
ing [8], and it is typically expensive and time consuming to prepare a good set of
labeled samples. Because only a few data are labeled, semisupervised learning [9]
may be a good candidate for solving this problem. However, [10] notes that due
to large numbers of noisy samples and outliers, directly using unlabeled data may
significantly reduce the learning performance.
better to practical variations and converge faster, especially in the early learning stage
of training.
(II) Earn more: The mixture of self-paced learning and active learning effectively
improves not only the classifier accuracy but also the classifier robustness against
noisy samples. From the perspective of AL, extra-high-confidence samples are auto-
matically incorporated into the retraining in each iteration without human labor
costs, thus gaining faster convergence. Introducing these high-confidence samples
also contributes to suppressing noisy samples in learning due to their compactness
and consistency in the feature space. From the SPL perspective, allowing active user
intervention generates reliable and diverse samples that can avoid learning being
misled by outliers. In addition, utilizing the CNN facilitates the pursuit of higher
classification performance by learning convolutional filters instead of handcrafted
feature engineering.
In brief, our ASPL framework includes two main phases. In the initial stage,
we first learn a general face representation using a convolutional neural network
architecture and train a batch of classifiers with a very small set of annotated samples
of different individuals. In the iteration learning stage, we rank the unlabeled samples
according to how they relate to the current classifiers and retrain the classifiers by
selecting and annotating samples in either an active user query or a self-paced manner.
We can also fine-tune the CNN based on the updated classifiers.
In this section, we first present a review of incremental face recognition and then
briefly introduce related developments in active learning and self-paced learning.
Incremental Face Recognition. There are two categories of methods address-
ing the problem of identifying faces with incremental data, namely, incremental
subspace and incremental classifier methods. The first category mainly includes
incremental versions of traditional subspace learning approaches, such as principal
component analysis (PCA) [23] and linear discriminant analysis (LDA) [12]. These
approaches map facial features into a subspace and keep the eigen representations
(i.e., eigen faces) up-to-date by incrementally incorporating new samples. In addi-
tion, face recognition is commonly accomplished by nearest neighbor-based feature
matching, which is computationally expensive when a large number of samples are
accumulated over time. On the other hand, the incremental classifier methods tar-
get updating the prediction boundary with the learned model parameters and new
samples. Exemplars include incremental support vector machines (ISVM) [24] and
online sequential forward neural networks [25]. In addition, several attempts have
been made to absorb advantages from both categories of methods. For example,
Ozawa et al. [26] proposed integrating incremental PCA with the resource alloca-
tion network in an iterative way. Although these approaches have made remarkable
progress, they suffer from low accuracy compared with batch-based state-of-the-art
face recognizers, and none of these approaches have been successfully validated on
118 9 Face Verification
large-scale datasets (e.g., more than 500 individuals). These approaches are basically
studied in the context of fully supervised learning; i.e., both initial and incremental
data have to be labeled.
Active Learning. This branch of research focuses mainly on actively selecting
and annotating the most informative unlabeled samples to avoid unnecessary and
redundant annotation. The key part of active learning is thus the selection strategy,
i.e., which samples should be presented to the user for annotation. One of the most
common strategies is certainty-based selection [27, 28], in which the certainties are
measured according to the predictions on new unlabeled samples obtained from the
initial classifiers. For example, Lewis et al. [27] propose taking the most uncertain
instance as the one that has the largest entropy on the conditional distribution over
its predicted labels. Several SVM-based methods [28] determine uncertain sam-
ples as those that are relatively close to the decision boundary. Sample certainty is
also measured by applying a committee of classifiers in [29]. These certainty-based
approaches usually ignore the large set of unlabeled instances and are thus sensitive
to outliers. A number of later methods present the information density measure by
exploiting the unlabeled data information when selecting samples. For example, the
informative samples are sequentially selected to minimize the generalization error
of the trained classifier on the unlabeled data based on a statistical approach [30]
or prior information [31]. In [32, 33], instances are taken to maximize the increase
of mutual information between the candidate instances and the remaining instances
based on Gaussian process models. The diversity of the selected instance over the
unlabeled data has also been taken into consideration [34]. Recently, Elhamifar
et al. [13] present a general framework via convex programming that considers both
the uncertainty and diversity measures for sample selection. However, these active
learning approaches usually emphasize low-confidence samples (e.g., uncertain or
diverse samples) while ignoring the other majority of high-confidence samples. To
enhance the discriminative capability, Wang et al. [9] propose a unified semisuper-
vised learning framework, which incorporates the high-confidence coding vectors
of unlabeled data into the training under the proposed effective iterative algorithm
and demonstrates its effectiveness in dictionary-based classification. Our work is
inspired by this study and also employs high-confidence samples to improve both
the accuracy and the robustness of classifiers.
Self-paced Learning. Inspired by the cognitive principle of humans/animals,
Bengio et al. [18] initialize the concept of curriculum learning (CL), in which a
model is learned by gradually including samples in training in a sequence from easy to
complex. To make this concept more implementable, Kumar et al. [19] substantially
implement this learning philosophy by formulating the CL principle as a concise
optimization model named self-paced learning (SPL). The SPL model includes a
weighted loss term on all samples and a general SPL regularizer imposed on sample
weights. By sequentially optimizing the model with a gradually increasing pace
parameter on the SPL regularizer, more samples can be automatically discovered
in a purely self-paced way. Jiang et al. [15, 16] provide a more comprehensive
understanding of the learning insight underlying SPL/CL and formulate the learning
model as a general optimization problem as follows:
9.2 Related Work 119
n
min vi L(w; xi , yi ) + f (v; λ)
w,v∈[0,1]n (9.1)
i=1
s.t. v ∈ ,
In this section, we illustrate how our ASPL model works. As illustrated in Fig. 9.1,
the main stages of our framework pipeline are CNN pretraining for face representa-
tion, classifier updating, self-paced high-confidence sample pseudo-labelling, low-
confidence sample annotation by active users, and CNN fine-tuning.
CNN pretraining: Before running the ASPL framework, we need to pretrain a
CNN for feature extraction on a given face dataset. These extra images are selected
without any overlap with our experimental data. Because several publicly available
CNN architectures [40, 41] have achieved remarkable success in visual recognition,
Fig. 9.1 Illustration of our proposed cost-effective framework. The pipeline includes stages of
CNN and model initialization; classifier updating; high-confidence sample labeling by the SPL and
low-confidence sample annotation by AL; and CNN fine-tuning, where the arrows represent the
workflow. The images highlighted in blue in the left panel represent the initially selected samples
9.3 Framework Overview 121
our framework supports directly employing these architectures and their pretrained
model as initialized parameters. In our experiments, AlexNet [40] is utilized. Given
the selection of extra annotated samples, we further fine-tune the CNN to learn
discriminative feature representation.
Initialization: At the beginning, we randomly select a few images for each indi-
vidual, extract feature representation for them by the pretrained CNN, and manually
annotate labels for them as the starting point.
Classifier updating: In our ASPL framework, we use one-versus-all linear SVM
as our classifier updating strategy. In the beginning, only a small portion of the
samples are labeled, and we train an initial classifier for every individual using these
samples. As the framework matures, samples manually annotated by the AL and
pseudo-labeled by the SPL increase, and we adopt them to retrain the classifiers.
High-confidence sample pseudo-labeling: We rank the unlabeled samples by their
important weights via the current classifiers, e.g., using the classification prediction
hinge loss, and then assign pseudo-labels to the top-ranked high-confidence samples.
This step can be automatically implemented by our system.
Low-confidence sample annotation: Based on certain AL criteria obtained under
the current classifiers, all unlabeled samples are ranked; then, the top-ranked samples
(most informative and generally with low confidence) are selected from the unlabeled
samples and manually annotated by active users.
CNN fine-tuning: After several steps of the interaction, we fine-tune the neural
network by the backpropagation algorithm. All samples self-labeled by the SPL and
manually annotated by the AL are added to the network, and we utilize the softmax
loss to optimize the CNN parameters via a stochastic gradient descent approach.
In this section, we will discuss the formulation of our proposed framework and
provide a theoretical interpretation of the entire pipeline from the perspective of
optimization. Specifically, we can theoretically justify the entire pipeline of this
framework because it is in fine accordance with a solving process for an active self-
paced learning (ASPL) optimization model. Such a theoretical understanding will
help deliver a more insightful understanding of the intrinsic mechanism underlying
the ASPL system.
In the context of face identification, suppose that we have n facial photos of
m subjects. Denote the training samples as D = {xi }i=1n
⊂ R d , where xi is the d-
dimensional feature representation of the ith sample. We have m classifiers for rec-
ognizing each sample by the one-versus-all strategy.
Knowledge learned from the data will be utilized to ameliorate our model after
a period of pace increase. Correspondingly, we denote the label set of xi as yi =
( j) ( j)
{yi ∈ {−1, 1}}mj=1 , where yi corresponds to the label of xi for the jth subject.
( j)
That is, if yi = 1, then xi is categorized as a face from the jth subject.
122 9 Face Verification
In our problem setting, we should make two necessary remarks. First, in our
investigated face identification problems, almost no data are labeled before running
our system. Only a very small number of samples are annotated as the initialization.
That is, most {yi }i=1
n
are unknown and need to be completed in the learning process.
In our system, a minority of the samples are manually annotated by active users, and
a majority are pseudo-labeled in a self-paced manner. Second, the data {xi }i=1
n
could
possibly be input into the system incrementally, meaning that the data scale might
be consistently growing.
Via the proposed mechanism of combining SPL and AL, our proposed ASPL
model can adaptively address both manually annotated and pseudo-labeled samples
and still progressively fit the consistently growing unlabeled data incrementally. The
ASPL is formulated as follows:
m
1
min w( j) 22 + (9.2)
/ λ}
{w,b,v,yi ∈{−1,1}m ,i ∈
j=1
2
C · L w( j) , b( j) , D, y( j) , v( j) + f v( j) ; λ j
s.t. v ∈ λ ,
where w = {w( j) }mj=1 ⊂ R d and b = {b( j) }mj=1 ⊂ R represent the weight and bias
parameters of the decision functions for all m classifiers. C(C > 0) is the standard
regularization parameter trading off the loss function and the margin, and we set C =
( j) ( j) ( j)
1 in our experiments. v = {[v1 , v2 , . . . , vn ]T }mj=1 denotes the weight variables
reflecting the training samples’ importance, and λ j is a parameter
(i.e., the pace age)
for controlling the learning pace of the jth classifier. f v( j) ; λ j is the self-paced
regularizer controlling the learning scheme. We denote the index collection of all
currently active annotated samples as λ = ∪mj=1 {λ j }, where λ j corresponds to
the set of the jth subject with the pace age λ j . Here, λ is introduced as a constraint
on yi . λ = ∩i=1
n
{iλ } composes the curriculum constraint of the model at the m
classifier pace age λ = {λ j }mj=1 . In particular, we specify two alternative types of
curriculum constraint for each sample xi , as
• iλ = [0, 1] for the pseudo-labeled sample, i.e., i ∈ / λ . Then, its weights with
( j) m
respect to all the classifiers {vi } j=1 need to be learned in the SPL optimization.
• iλ = {1} is the sample annotated by the AL process, i.e., ∃ j s.t. i ∈ λ j . Thus,
( j)
its weights are deterministically set during the model training, i.e., vi = 1.
Each type of curriculum will be interpreted in detail in Sect. 9.2. Note that in
contrast to the previous SPL settings, this curriculum iλ can be dynamically changed
with respect to all the pace ages λ of m classifiers. This confirms the superiority of
our model, as we discuss at the end ofthis section.
We then define the loss function L w( j) , b( j) , D, y( j) , v( j) on x as
9.4 Formulation and Optimization 123
L w( j) , b( j) , D, y( j) , v( j)
n
( j) ( j)
= vi l w( j) , b( j) ; xi , yi
i=1
n (9.3)
( j) ( j)
= vi 1 − yi (w( j)T xi + b( j) )
+
i=1
m
( j) ( j)
s.t. |yi + 1| ≤ 2, yi / λ ,
∈ {−1, 1}, i ∈
j=1
( j)
where 1 − yi (w( j)T xi + b( j) ) is the hinge loss of xi in the jth classifier. The
+
cost term corresponds to the summarized loss of all classifiers, and the constraint
( j)
term allows only two types of feasible solutions: (i) for any i, there exists yi = 1,
( j)
while for all other yi(k) = −1 for all k
= j; (ii) yi = −1 for all j = 1, 2, . . . , m
(i.e., background or an unknown person class). These samples xi are added to the
unknown sample set U . Clearly, such a constraint complies with real-life cases in
which a sample should be categorized in one prespecified subject or not classified in
any of the current subjects.
Referring to the known alternative search strategy, we can then solve this optimiza-
tion problem. Specifically, the algorithm is designed by alternatively updating the
classifier parameters w, b via one-versus-all SVM, the sample importance weights v
via the SPL, and the pseudo-label y via reranking. In addition to gradually increas-
ing the pace parameter λ, the optimization updates (i) the curriculum constraint λ
via AL and (ii) the feature representation via fine-tuning the CNN. In the following
section, we introduce the details of these optimization steps and their physical inter-
pretations. The correspondence of this algorithm to the practical implementation of
the ASPL system will also be discussed at the end.
Initialization: As introduced in the framework, we initialize running our system
by using a pretrained CNN to extract feature representations of all samples {xi }i=1 n
.
Set an initial m classifier pace parameter λ = {λ j } j=1 . Initialize the curriculum con-
m
straint λ with currently user-annotated samples λ and the corresponding {y( j) }mj=1
and v.
Classifier Updating: This step aims to update the classifier parameters {w( j) ,
b( j) }mj=1 by one-versus-all SVM. When {{xi }i=1
n
, v, {yi }i=1
n
, λ }, the original ASPL
model Eq. (9.2) can be simplified in the following form:
m
1
n
( j) ( j)
min w( j) 22 + C vi l w( j) , b( j) ; xi , yi ,
w,b
j=1
2 i=1
1 ( j) 2 ( j) n
( j)
min w 2 + C vi l w( j) , b( j) ; xi , yi . (9.4)
w( j) ,b( j) 2 i=1
This is a standard one-versus-all SVM model with weights that takes a one-class
( j)
sample as positive and all others as negative. Specifically, when the weights vi
are only of values {0, 1}, the model corresponds to a simplified SVM model under
( j) j
the sampled instances with vi = 1; otherwise, when vi sets values from [0, 1], it
corresponds to the weighted SVM model. Both models can readily be solved by many
off-the-shelf efficient solvers. Thus, this step can be interpreted as implementing one-
versus-all SVM over manually annotated instances from the AL and self-annotated
instances from the SPL.
High-confidence Sample Labeling: This step aims to assign pseudo-labels y and
the corresponding important weights v to the top-ranked high-confidence samples.
We start by employing the SPL to rank the unlabeled samples according to their
weights v. Under fixed {w, b, {xi }i=1
n
, {yi }i=1
n
, λ }, our ASPL model in Eq. (9.2) can
be simplified to optimize v as
m
n
( j) ( j)
min C vi l w( j) , b( j) ; xi , yi + f v( j) ; λ j ,
v∈[0,1] (9.5)
j=1 i=1
s.t. v ∈ λ .
The problem then degenerates to a standard SPL problem, as in Eq. (9.1). Because
both the self-paced regularizer f (v( j) ; λ j ) and the curriculum constraint λ are
convex (with respect to v), various existing convex optimization techniques, such
as gradient-based or interior-point methods, can be used to solve it. Note that we
have multiple choices for the self-paced regularizer, as those are built in [16, 17].
All of them comply with the three axiomatic conditions required for a self-paced
regularizer, as defined in Sect. 9.2.
Based on the second axiomatic condition for a self-paced regularizer, any of the
above f (v( j) ; λ j ) tends to conduct larger weights on high-confidence (i.e., easy)
samples with fewer loss values and vice versa, which evidently facilitates the model
with the “learning from easy to hard” insight. In all our experiments, we utilize the
linear soft weighting regularizer due to its relatively easy implementation and good
adaptability to complex scenarios. This regularizer penalizes the sample weights
linearly in terms of the loss. Specifically, we have
1 ( j)
n
f (v( j) , λ j ) = λ j ( v( j) 22 − vi ), (9.6)
2 i=1
where λ j > 0. Equation (9.6) is convex with respect to v( j) , and we can, therefore,
search for its global optimum by computing the partial gradient equals. Consider-
( j)
ing vi ∈ [0, 1], we deduce the analytical solution for the linear soft weighting as
follows:
9.4 Formulation and Optimization 125
Ci j
( j) − + 1, Ci j < λ j
vi = λj (9.7)
0, otherwise,
( j)
where i j = l w( j) , b( j) ; xi , yi is the loss of xi in the jth classifier. Note that the
way to deduce Eq. (9.7) is similar to the way used in [16], but the resulting solution
is different because our ASPL model in Eq. (9.2) is new.
After obtaining the weight v for all unlabeled samples (i ∈ / λ ) according to the
optimized v( j) in descending order, we consider the samples with larger weights
high-confidence samples. We form these samples into a high-confidence sample set
S and assign them pseudo-labels: Fixing {w, b, {xi }i=1 n
, λ , v}, we optimize yi of
Eq. (9.2), which corresponds to solving
n
m
( j)
minm vi i j
yi ∈{−1,1} ,i∈S
i=1 j=1
(9.8)
m
( j)
s.t., |yi + 1| ≤ 2,
j=1
where vi is fixed and can be treated as a constant. When xi belongs to a certain person
class, Eq. (9.8) has an optimum that can be extracted exactly by the Theorem 1. The
proof is specified in the supplementary material.
( j)
Those js that satisfy w( j)T xi + b( j)
= 0 and vi ∈ (0, 1] are denoted as a set M
( j) ( j)
and set all yi = −1 for others in default. The solution of Eq. (9.8) for yi , j ∈ M
can be obtained by the following theorem.
( j ∗)
(b) When ∀ j ∈ M except j = j ∗ , w( j)T xi + b( j) < 0, i.e., vi i j ∗ > 0, then
Eq. (9.8) has a solution:
( j) −1, j
= j ∗
yi = ;
1, j = j ∗
where
( j) ( j)
j ∗ = arg min vi i j − 1 + (w T xi + b( j) ) . (9.9)
1≤ j≤m +
126 9 Face Verification
sample, our AL process performs the following two operations: (i) set its curriculum
constraint, i.e., {iλ }i∈φ = {1} and (ii) update its labels {yi }i∈φ , and add its index to
the set of currently annotated samples λ . Such a specified curriculum still complies
with the axiomatic conditions for the curriculum constraint as defined in [15]. For
the annotated samples, the corresponding iλ = {1} with expectation value 1 over
the whole set, while for others, iλ = [0, 1] with expectation value 1/2. Thus, the
more informative samples still have a larger expectation than the others. Also, λ
is clearly nonempty and convex. It, therefore, complies with traditional curriculum
understanding.
New Class Handling: After the AL process, if the active user annotates the
selected unlabeled samples with u unseen person classes, then the new classifiers for
these unseen classes need to be initialized without affecting the existing classifiers.
Moreover, there is another difficulty in that the samples of the new class are not
enough for classifier training. Owing to the proposed ASPL framework, we employ
the following four steps to address the abovementioned issues.
(1) For each of the new class samples, search all the unlabeled samples and pick
out its K -nearest neighbors from the unseen class set U in the feature space;
(2) Require the active user to annotate these selected neighbors to enrich the
positive samples for the new person classes; and
(3) Initialize and update {w( j) , b( j) , v( j) , y( j) , λ j }m+u
j=m+1 for these new person
classes according to the abovementioned iteration process of {initialization, classifier
updating, high-confidence sample labeling, and low-confidence sample annotating}.
This step corresponds to the instructor’s role in human education, which aims to
guide a student to a more informative curriculum. In contrast to the previous fixed
curriculum setting in SPL throughout the learning process, here, the curriculum is
dynamically updated based on the self-paced learned knowledge of the model. Such
an improvement better simulates the general learning process of a good student.
As the learned knowledge of a student increases, his/her instructor should vary the
curriculum settings imposed on him/her from more in the early stage to less in the
later stage. This learning method evidently has a better learning effect that can be
adapted to the personal information of the student.
Feature Representation Updating: After several of the SPL and AL updat-
ing iterations of {w, b, {yi }i=1
n
, v, λ }, we aim to update the feature representation
{xi }i=1 through fine-tuning the pretrained CNN by inputting all manually labeled
n
samples from AL and the self-annotated samples from SPL. These samples tend to
deliver data knowledge to the network and improve the representation of the training
samples. Better feature representation is, therefore, expected to be extracted from
this fine-tuned CNN.
This learning process simulates the updating of the knowledge structure of a
human brain after a period of domain learning. Such updating tends to facilitate a
person’s ability to grasp more effective features to represent newly emerging samples
from certain domains and enables him/her to perform better as a learner. In our
experiments, we generally fine-tune the CNN after approximately 50 rounds of SPL
and AL updating, and the learning rate is set as 0.001 for all layers.
Pace Parameter Updating: We utilize a heuristic strategy to update the pace
parameter {λ j }mj=1 for m classifiers in our implementation.
After multiple iterations of the ASPL, we specifically set the pace parameter λ j
for each individual classifier and utilize a heuristic strategy in our implementation
of parameter updating. For the tth iteration, we compute the pace parameter for
optimizing Eq. (9.2) by
⎧
⎨ λ0 , t =0
λtj = λ(t−1) + α ∗ ηtj , 1≤t ≤τ (9.10)
⎩ j (t−1)
λj , t > τ,
where ηtj is the average accuracy of the jth classifier in the current iteration and α is
a parameter that controls the pace increase rate. In our experiments, we empirically
set {λ0 , α} = {0.2, 0.08}. Note that the pace parameters λ should be stopped when
all training samples are with v = {1}. Thus, we introduce an empirical threshold τ
constraint that λ is updated only in early iterations, i.e., t ≤ τ . τ is set as 12 in our
experiments.
The entire algorithm can then be summarized in Algorithm 1. It is easy to see
that this solving strategy for the ASPL model finely accords with the pipeline of our
framework.
Convergence Discussion: As illustrated in Algorithm 1, the ASPL algorithm alter-
natively updates the variables, including the classifier parameters w, b (by weighted
9.4 Formulation and Optimization 129
SVM), the pseudo-labels y (closed-form solution by Theorem 1), the weight v (by
SPL), and the low-confidence sample annotations φ (by AL). For the first three
parameters, the updates are calculated by a global optimum obtained from a sub-
problem of the original model; thus, the decrease of the objective function can be
guaranteed. However, similar to other existing AL techniques, human efforts are
involved in the loop of the AL stage; thus, the monotonic decrease of the objec-
tive function cannot be guaranteed in this step. As the learning proceeds, the model
tends to become increasingly mature, and the AL labor tends to lessen in the later
learning stage. Thus, with gradually less involvement of the AL calculation in our
algorithm, the monotonic decrease of the objective function through iteration tends
to be promised, and thus, our algorithm tends to be convergent.
References
1. L. Lin, K. Wang, D. Meng, W. Zuo, L. Zhang, Active self-paced learning for cost-effective
and progressive face identification, in IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 40, no. 1, pp. 7–19, 1 Jan. 2018
2. F. Celli, E. Bruni, B. Lepri, Automatic personality and interaction style recognition from
facebook profile pictures, in ACM Conference on Multimedia (2014)
3. Z. Stone, T. Zickler, T. Darrell, Toward large-scale face recognition using social network
context. Proc. IEEE 98, (2010)
4. Z. Lei, D. Yi, and S. Z. Li. Learning stacked image descriptor for face recognition. IEEE Trans.
Circuit. Syst. Video Technol. PP(99), 1–1 (2015)
5. S. Liao, A.K. Jain, S.Z. Li, Partial face recognition: alignment-free approach. IEEE Transactions
on Pattern Analysis and Machine Intelligence 35(5), 1193–1205 (2013)
6. D. Yi, Z. Lei, S. Z. Li, Towards pose robust face recognition, in Computer Vision and Pattern
Recognition (CVPR), 2013 IEEE Conference on, pp. 3539–3545 (2013)
7. X. Zhu, Z. Lei, J. Yan, D. Yi, S.Z. Li, High-fidelity pose and expression normalization for face
recognition in the wild, in 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 787–796 (2015)
8. Yi Sun, Xiaogang Wang, and Xiaoo Tang. Hybrid deep learning for face verification. In Proc.
of IEEE International Conference on Computer Vision (2013)
9. X. Wang, X. Guo, S. Z. Li, Adaptively unified semi-supervised dictionary learning with active
points, in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1787–1795
(2015)
10. Yu-Feng Li, Zhi-Hua Zhou, Towards making unlabeled data never hurt. IEEE Trans. Pattern
Anal. Mach. Intelligence 37(1), 175–188 (2015)
11. H. Zhao et al., A novel incremental principal component analysis and its application for face
recognition (SMC, IEEE Transactions on, 2006)
12. T.-K. Kim, K.-Y. Kenneth Wong, B. Stenger, J. Kittler, R. Cipolla, Incremental linear discrimi-
nant analysis using sufficient spanning set approximations, in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition (2007)
13. E. Ehsan, S. Guillermo, Y. Allen, S.S. Shankar, A convex optimization framework for active
learning, in Proceedings of IEEE International Conference on Computer Vision (2013)
14. K. Wang, D. Zhang, Y. Li, R. Zhang, L. Lin, Cost-effective active learning for deep image
classification. IEEE Trans. Circuits Syst. Video Technol. PP(99), 1–1 (2016)
15. L. Jiang, D. Meng, Q. Zhao, S. Shan, A.G. Hauptmann, Self-paced curriculum learning. Pro-
ceedings of AAAI Conference on Artificial Intelligence (2015)
130 9 Face Verification
16. L. Jiang, D. Meng, T .Mitamura, A.G. Hauptmann, Easy samples first: self-paced reranking
for zero-example multimedia search, in ACM Conference on Multimedia (2014)
17. L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, A. Hauptmann, Self-paced learning with diversity,
in Proceedings of Advances in Neural Information Processing Systems (2014)
18. Y. Bengio, J. Louradour, R. Collobert, J. Weston, Curriculum learning, in Proceedings of IEEE
International Conference on Machine Learning (2009)
19. M Pawan Kumar et al., Self-paced learning for latent variable models, in Proceedings of
Advances in Neural Information Processing Systems (2010)
20. G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S.Z. Li, T. Hospedales, When face recognition
meets with deep learning: an evaluation of convolutional neural networks for face recognition,
in The IEEE International Conference on Computer Vision (ICCV) Workshops (2015)
21. Y. LeCun, K. Kavukcuoglu, C. Farabet, Convolutional networks and applications in vision, in
ISCAS (2010)
22. K. Wang, L. Lin, W. Zuo, S. Gu, L. Zhang, Dictionary pair classifier driven convolutional
neural networks for object detection, in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 2138–2146, June 2016
23. L.I. Smith, A tutorial on principal components analysis. Cornell University, USA 51, 52 (2002)
24. M. Karasuyama, I. Takeuchi, Multiple incremental decremental learning of support vector
machines, in Proceedings of Advances in Neural Information Processing Systems (2009)
25. N.-Y. Liang et al., A fast and accurate online sequential learning algorithm for feedforward
networks (Neural Networks, IEEE Transactions on, 2006)
26. S. Ozawa et al., Incremental learning of feature space and classifier for face recognition. Neural
Networks 18, (2005)
27. D.D. Lewis, W.A. Gale, A sequential algorithm for training text classifiers, in ACM SIGIR
Conference (1994)
28. S. Tong, D. Koller, Support vector machine active learning with applications to text classifica-
tion. J. Mach. Learn. Res. 2, (2002)
29. A.K. McCallumzy, K. Nigamy, Employing em and pool-based active learning for text classi-
fication, in Proceedings of IEEE International Conference on Machine Learning (1998)
30. A.J. Joshi, F. Porikli, N. Papanikolopoulos, Multi-class active learning for image classification,
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009)
31. A. Kapoor, G. Hua, A. Akbarzadeh, S. Baker, Which faces to tag: Adding prior constraints into
active learning, in Proceedings of IEEE International Conference on Computer Vision (2009)
32. A. Kapoor, K. Grauman, R. Urtasun, T. Darrell, Active learning with gaussian processes for
object categorization, in Proceedings of IEEE International Conference on Computer Vision
(2007)
33. X. Li, Y. Guo, Adaptive active learning for image classification, in Proceedings of IEEE Con-
ference on Computer Vision and Pattern Recognition (2013)
34. K. Brinker, Incorporating diversity in active learning with support vector machines, in Pro-
ceedings of IEEE International Conference on Machine Learning (2003)
35. Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, A.G. Hauptmann, Self-paced learning for matrix
factorization, in Proceedings of AAAI Conference on Artificial Intelligence (2015)
36. M.P. Kumar, H. Turki, D. Preston, D. Koller, Learning specific-class segmentation from diverse
data, in Proceedings of IEEE International Conference on Computer Vision (2011)
37. Y.J. Lee, K. Grauman, Learning the easy things first: Self-paced visual category discovery, in
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2011)
38. J.S. Supancic, D. Ramanan, Self-paced learning for long-term tracking, in Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition (2013)
39. S. Yu et al., Cmu-informedia@ trecvid 2014 multimedia event detection, in TRECVID Video
Retrieval Evaluation Workshop (2014)
40. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional
neural networks. Advances in Neural Information Processing Systems 25, 1097–1105 (2012)
41. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recogni-
tion, in ICLR (2015)
Part V
Higher Level Tasks
In computer vision, increasing attention has been paid to understanding human activ-
ity to determine what people are doing in a given video in different application
domains, e.g., intelligent surveillance, robotics, and human–computer interaction.
Recently developed 3D/depth sensors have opened new opportunities with enor-
mous commercial value by providing richer information (e.g., extra depth data of
scenes and objects) than traditional cameras. By building upon this enriched infor-
mation, human poses can be estimated more easily. However, modeling complicated
human activities remains challenging.
Many works on human action/activity recognition focus mainly on designing
robust and descriptive features [1, 2]. For example, Xia and Aggarwal [1] extract
spatiotemporal interest points from depth videos (DSTIP) and developed a depth
cuboid similarity feature (DCSF) to model human activities. Oreifej and Liu [2] pro-
pose capturing spatiotemporal changes in activities by using a histogram of oriented
4D surface normals (HON4D). Most of these methods, however, overlook detailed
spatiotemporal structural information and limited periodic activities.
Several compositional part-based approaches that have been studied for complex
scenarios have achieved substantial progress [3, 4]; they represent an activity with
deformable parts and contextual relations. For instance, Wang et al. [3] recognized
human activities in common videos by training the hidden conditional random fields
in a max-margin framework. For activity recognition in RGB-D data, Packer et al.
[5] employed the latent structural SVM to train the model with part-based pose
trajectories and object manipulations. An ensemble model of actionlets was studied
in [4] to represent 3D human activities with a new feature called the local occupancy
pattern (LOP). To address more complicated activities with large temporal variations,
some improved models discover the temporal structures of activities by localizing
sequential actions. For example, Wang and Wu [6] propose solving the temporal
alignment of actions by maximum margin temporal warping. Tang et al. [7] capture
the latent temporal structures of 2D activities based on the variable-duration hidden
Markov model. Koppula and Saxena [8] apply conditional random fields to model
the subactivities and affordances of the objects for 3D activity recognition.
In the depth video scenario, Packer et al. [5] address action recognition by mod-
eling both pose trajectories and object manipulations with a latent structural SVM.
Wang et al. [4] develop an actionlet ensemble model and a novel feature called the
132 Part V: Higher Level Tasks
References
10.1 Introduction
Fig. 10.1 Two activities of the same category. We consider one activity as a sequence of actions
that occur over time; the temporal composition of an action may differ for different subjects
segment of a flexible length. Our model is inspired by the effectiveness of two widely
successful techniques: deep learning [8–13] and latent structure models [14–18]. One
example of the former is the convolutional neural network (CNN), which was recently
applied to generate powerful features for video classification [13, 19]. On the other
hand, latent structure models (such as the deformable part-based model [15]) have
been demonstrated to be an effective class of models for managing large object vari-
ations for recognition and detection. One of the key components of these models is
the reconfigurable flexibility of the model structure, which is often implemented by
estimating latent variables during inference.
We adopt the deep CNN architecture [8, 13] to layer-wise extract features from
the input video data, and the architecture is vertically decomposed into several sub-
networks corresponding to the video segments, as Fig. 10.2 illustrates. In particular,
our model searches for the optimal composition of each activity instance during
recognition, which is the key to managing temporal variation in human activities.
Moreover, we introduce a relaxed radius-margin bound into our deep model, which
effectively improves the generalization performance for classification.
In this section, we introduce the main components of our deep structured model,
including the spatiotemporal CNNs, the latent structure of activity decomposition,
and the radius-margin bound for classification.
10.2 Deep Structured Model 137
Fig. 10.2 The architecture of spatiotemporal convolutional neural networks. The neural networks
are stacked convolutional layers, max-pooling operators, and a fully connected layer, where the raw
segmented videos are treated as the input. A subnetwork is referred to as a vertically decomposed
subpart with several stacked layers that extracts features for one segmented video section (i.e.,
one subactivity). Moreover, by using the latent variables, our architecture is capable of explicitly
handling diverse temporal compositions of complex activities
138 10 Human Activity Understanding
Fig. 10.3 Illustration of incorporating the latent structure into the deep model. Different subnet-
works are denoted by different colors
during inference and learning. Assume that the input video is temporally divided into
a number M of segments corresponding to the subactivities. We index each video
segment by its starting anchor frame s j and its temporal length (i.e., the number of
frames) t j . t j is greater than m, i.e., t j ≥ m. To address the large temporal varia-
tion in human activities, we make s j and t j variables. Thus, for all video segments,
we denote the indexes of the starting anchor frames as (s1 , ..., s M ) and their tem-
poral lengths as (t1 , ..., t M ); these are regarded as the latent variables in our model,
h = (s1 , ..., s M , t1 , ..., t M ). These latent variables specifying the segmentation will
be adaptively estimated for different input videos.
We associate the CNNs with the video segmentation by feeding each segmented
part into a subnetwork, as Fig. 10.2 illustrates. Next, according to the method of
video segmentation (i.e., decomposition of subactivities), we manipulate the CNNs
by inputting the sampled video frames. Specifically, each subnetwork takes m video
frames as the input, and if some frames are more than m, according to the latent
variables, e.g., t j > m, then a uniform sampling is performed to extract m key frames.
Figure 10.3 shows an intuitive example of our structured deep model in which the
input video is segmented into three sections corresponding to the three subnetworks
in our deep architecture. Thus, the configuration of the CNNs is dynamically adjusted
in addition to searching for the appropriate latent variables of the input videos. Given
the parameters of the CNNs ω and the input video xi with its latent variables h i , the
generated feature of xi can be represented as φ(xi ; ω, h i ).
10.2 Deep Structured Model 139
A large amount of training data is crucial for the success of many deep learning
models. Given sufficient training data, the effectiveness of applying the softmax
classifier to CNNs has been validated for image classification [20]. However, for 3D
human activity recognition, the available training data are usually less than expected.
For example, the CAD-120 dataset [21] consists of only 120 RGB-D sequences of 10
categories. In this scenario, although parameter pretraining and dropout are available,
the model training often suffers from overfitting. Hence, we consider introducing a
more effective classifier in addition to the regularizer to improve the generalization
performance of the deep model.
In supervised learning, the support vector machine (SVM), also known as the max-
margin classifier, is theoretically sound and generally can achieve promising perfor-
mance compared with the alternative linear classifiers. In deep learning research, the
combination of SVM and CNNs has been exploited [22] and has obtained excellent
results in object detection [23]. Motivated by these approaches, we impose a max-
margin classifier (w, b) upon the feature generated by the spatiotemporal CNNs for
human activity recognition.
As a max-margin classifier, standard SVM adopts w2 , the reciprocal of the
squared margin γ 2 , as the regularizer. However, the generalization error bound of
SVM depends on the radius-margin ratio R 2 /γ 2 , where R is the radius of the mini-
mum enclosing ball (MEB) of the training data [24]. When the feature space is fixed,
the radius R is constant and can, therefore, be ignored. However, in our approach, the
radius R is determined by the MEB of the training data in the feature space generated
by the CNNs. In this scenario, there is a risk that the margin can be increased by
simply expanding the MEB of the training data in the feature space. For example,
simply multiplying a constant to the feature vector can enlarge the margin between
the positive and negative samples, but obviously, this approach will not enable better
classification. To overcome this problem, we incorporate the radius-margin bound
into the feature learning, as Fig. 10.4 illustrates. In particular, we impose a max-
margin classifier with radius information upon the feature generated by the fully
connected layer of the spatiotemporal CNNs. The optimization tends to maximize
the margin while shrinking the MEB of the training data in the feature space, and we
thus obtain a tighter error bound.
Suppose there is a set of N training samples (X, Y ) = {(x1 , y1 ), ... , (x N , y N )},
where xi is the video, y ∈ {1, ..., C} represents the category labels, and C is the
number of activity categories. We extract the feature for each xi by the spatiotemporal
CNNs, φ(xi ; ω, h i ), where h i refers to the latent variables. By adopting the squared
hinge loss and the radius-margin bound, we define the following loss function L 0 of
our model:
140 10 Human Activity Understanding
Fig. 10.4 Illustration of our deep model with the radius-margin bound. To improve the generaliza-
tion performance for classification, we propose integrating the radius-margin bound as a regularizer
with feature learning. Intuitively, as well as optimizing the max-margin parameters (w, b), we shrink
the radius R of the minimum enclosing ball (MEB) of the training data that are distributed in the
feature space generated by the CNNs. The resulting classifier with the regularizer shows better
generalization performance than the traditional softmax output
Radius−margin Ratio
1
L0 = w2 Rφ2
2 (10.1)
N
T 2
+λ max 0, 1 − w φ(xi ; ω, h i ) + b yi ,
i=1
where λ is the trade-off parameter, 1/w denotes the margin of the separating hyper-
plane, b denotes the bias, and Rφ denotes the radius of the MEB of the training data
φ(X, ω, H ) = {φ(x1 ; ω, h 1 ), ..., φ(x N ; ω, h N )} in the CNN feature space. Formally,
the radius Rφ is defined as [24, 25],
The radius Rφ is implicitly defined by both the training data and the model param-
eters, meaning (i) the model in Eq. (10.1) is highly nonconvex, (ii) the derivative of
Rφ with respect to ω is hard to compute, and (iii) the problem is difficult to solve using
the stochastic gradient descent (SGD) method. Motivated by the radius-margin-based
SVM [26, 27], we investigate using the relaxed form to replace the original definition
of Rφ in Eq. (10.2). In particular, we introduce the maximum pairwise distance R̃φ
over all the training samples in the feature space as
Do and Kalousis [26] proved that Rφ could be well bounded by R̃φ with the
Lemma 2,
10.2 Deep Structured Model 141
Lemma 2 √
1+ 3
R̃φ ≤ Rφ ≤ R̃φ .
2
The abovementioned lemma guarantees that the true radius Rφ can be well approx-
imated by R̃φ . With the proper parameter η, the optimal solution for minimizing the
radius-margin ratio w2 Rφ2 is the same as that for minimizing the radius-margin
sum w2 + η Rφ2 [26]. Thus, by approximating Rφ2 with R̃φ2 and replacing the radius-
margin ratio with the radius-margin sum, we suggest the following deep model with
the relaxed radius-margin bound:
1
L1 = w2 + max φ(xi ; ω, h i ) − φ(x j ; ω, h j )2
2 i, j
(10.4)
N
2
+λ max 0, 1 − w T φ(xi ; ω, h i ) + b yi .
i=1
However, the first max operator in Eq. (10.4) is defined over all training sample
pairs, and the minibatch-based SGD optimization method is, therefore, unsuitable.
Moreover, the radius in Eq. (10.4) is determined by the maximum distances of the
sample pairs in the CNN feature space, and it might be sensitive to outliers. To address
these issues, we approximate the max operator with a softmax function, resulting in
the following model:
1
L2 = w2 + η κi j φ(xi ; ω, h i ) − φ(x j ; ω, h j )2
2 i, j
2 (10.5)
N
+λ max 0, 1 − w φ(xi ; ω, h i ) + b yi
T
i=1
with
exp(αφ(xi ; ω, h i ) − φ(x j ; ω, h j )2 )
κi j = , (10.6)
i j exp(αφ(x i ; ω, h i ) − φ(x j ; ω, h j ) )
2
(10.7)
N
T 2
+λ max 0, 1 − w φ(xi ; ω, h i ) + b yi
i=1
142 10 Human Activity Understanding
with
1
φ̄ω = φ(xi ; ω, h i ). (10.8)
N i
The optimization objects in Eqs. (10.5) and (10.7) are two relaxed losses of our
deep model with the strict radius-margin bound in Eq. (10.1). The derivatives of the
relaxed losses with respect to ω are easy to compute, and the models can be readily
solved via SGD, which will be discussed in detail in Sect. 10.4.
10.3 Implementation
In this section, we first explain the implementation that makes our model adaptive
to an alterable temporal structure and then describe the detailed setting of our deep
architecture.
During our learning and inference procedures, we search for the appropriate latent
variables that determine the temporal decomposition of the input video (i.e., the
decomposition of activities). There are two parameters relating to the latent vari-
ables in our model: the number M of video segments and the temporal length m of
each segment. Note that the subactivities decomposed by our model have no precise
definition in a complex activity, i.e., actions can be ambiguous depending on the
temporal scale being considered.
To incorporate the latent temporal structure, we associate the latent variables with
the neurons (i.e., convolutional responses) in the bottom layer of the spatiotemporal
CNNs.
The choice of the number of segments M is important for the performance of
3D human activity recognition. The model with a small M could be less expressive
in addressing temporal variations, while a large M could lead to overfitting due to
high complexity. Furthermore, when M = 1, the model latent structure is disabled,
and our architecture degenerates to the conventional 3D-CNNs [13]. By referring
to the setting of the number of parts for the deformable part-based model [15] in
object detection, the value M can be set by cross-validation on a small set. In all our
experiments, we set M = 4.
Considering that the number of frames of the input videos is diverse, we develop
a process to normalize the inputs by two-step sampling in the learning and inference
procedure. First, we sample 30 anchor frames uniformly from the input video. Based
on these anchor frames, we search for all possible nonoverlapping temporal segmen-
tations, and the anchor frame segmentation corresponds to the segmentation of the
input video. Then, from each video segment (indicating a subactivity), we uniformly
10.3 Implementation 143
sample m frames to feed the neural networks, and in our experiments, we set m = 9.
In addition, we reject the possible segmentations that cannot offer m frames for any
video segment.
For an input video, the possibility of temporal structure variations (i.e., the possible
enumeration number of anchor frame segmentations) is 115 in our experiments.
Fig. 10.5 Illustration of the 3D convolutions across both spatial and temporal domains. In this
example, the temporal dimension of the 3D kernel is 3, and each feature map is thus obtained by
performing 3D convolutions across 3 adjacent frames
144 10 Human Activity Understanding
k −1 h −1 m
−1
vx ys = tanh(b + ωi jk · p(x+i)(y+ j)(s+k) ), (10.9)
i=0 j=0 k=0
where p(x+i)(y+ j)(s+k) denotes the pixel value of the input video p at position (x +
i, y + j) in the (s + k)th frame, ωi jk denotes the value of the convolutional kernel
ω at position (i, j, k), b stands for the bias, and tanh denotes the hyperbolic tangent
function. Thus, given p and ω, m − m + 1 feature maps can be obtained, each with
a size of (w − w + 1, h − h + 1).
Based on the 3D convolutional operation, a 3D convolutional layer is designed
for spatiotemporal feature extraction by considering the following three issues:
In our implementation, the input frame is scaled with height h = 80 and width
w = 60. In the first 3D convolutional layer, the number of 3D convolutional kernels
is c1 = 7, and the size of the kernel is w × h × m = 9 × 7 × 3. In the second
layer, the number of 3D convolutional kernels is c2 = 5, and the size of the kernel
is w × h × m = 7 × 7 × 3. Thus, we have 7 sets of feature maps after the first
3D convolutional layer and obtain 7 × 5 sets of feature maps after the second 3D
convolutional layer.
Max-pooling Operator. After each 3D convolution, the max-pooling operation is
introduced to enhance the deformation and shift invariance [20]. Given a feature map
with a size of a1 × a2 , a d1 × d2 max-pooling operator is performed by taking the
maximum of every nonoverlapping d1 × d2 subregion of the feature map, resulting in
an a1 /d1 × a2 /d2 pooled feature map. In our implementation, a 3 × 3 max-pooling
operator was applied after every 3D convolutional layer. After two layers of 3D
10.3 Implementation 145
where p(x+i)(y+ j) denotes the pixel value of the feature map p at position (x + i, y +
j), ωi j denotes the value of the convolutional kernel ω at position (i, j), and b denotes
the bias. In the 2D convolutional layer, if the number of 2D convolutional kernels is
c3 , then c1 × c2 × c3 sets of new feature maps are obtained by performing 2D con-
volutions on c1 × c2 sets of feature maps generated by the second 3D convolutional
layer.
In our implementation, the number of 2D convolutional kernels is set as c3 = 4
with a kernel size of 6 × 4. Hence, for each subnetwork, we can obtain 700 feature
maps with a size of 1 × 1.
Fully Connected Layer. There is only one fully connected layer with 64 neu-
rons in our architecture. All these neurons connect to a vector of 700 × 4 = 2800
dimensions, which is generated by concatenating the feature maps from all the sub-
networks. Because the training data are insufficient, and a large number of param-
eters (i.e., 179200) exist in this fully connected layer, we adopt the commonly used
dropout trick with a 0.6 rate to prevent overfitting. The margin-based classifier is
defined based on the output of the fully connected layer, where we adopt the squared
hinge loss to predict the activity categories as
where z is the 64-dimensional vector from the fully connected layer, and {wi , bi }
denotes the weight and bias connected to the ith activity category.
The proposed deep structured model involves three components to be optimized: (i)
the latent variables H that manipulate the activity decomposition, (ii) the margin-
based classifier {w, b}, and (iii) the CNN parameters ω. The latent variables are not
continuous and need to be estimated adaptively for different input videos, making
the standard backpropagation algorithm [8] unsuitable for our deep model. In this
section, we present a joint component learning algorithm that iteratively optimizes
the three components. Moreover, to overcome the problem of insufficient 3D data,
146 10 Human Activity Understanding
If (X, Y ) = {(x1 , y1 ), ... , (x N , y N )} are denoted as the training set with N examples,
where xi is the video, then yi ∈ {1, ..., C} denotes the activity category. Denote
H = {h 1 , ..., h N } as the set of latent variables for all training examples. The model
parameters to be optimized can be divided into three groups, i.e., H , {w, b}, and ω.
Fortunately, given any two groups of parameters, the other group of parameters can be
efficiently learned using either the stochastic gradient descent (SGD) algorithm (e.g.,
for {w, b} and ω) or enumeration (e.g., for H ). Thus, we conduct the joint component
learning algorithm by iteratively updating the three groups of parameters with three
steps: (i) Given the model parameters {w, b} and ω, we estimate the latent variables
h i for each video and update the corresponding feature φ(xi ; ω, h i ) (Fig. 10.6a);
(b)
(a) (c)
Fig. 10.6 Illustration of our joint component learning algorithm, which iteratively performs in
three steps: a Given the classification parameters {w, b} and the CNN parameters ω, we estimate
the latent variables h i for each video and generate the corresponding feature φ(xi ; ω, h i ); b given
the updated features φ(X ; ω, H ) for all training examples, the classifier {w, b} is updated via
SGD; and (c) given {w, b} and H , backpropagation updates the CNN parameters ω
10.4 Learning Algorithm 147
(ii) given the updated features φ(X ; ω, H ), we adopt SGD to update the max-margin
classifier {w, b} (Fig. 10.6b); and (iii) given the model parameters {w, b} and H , we
employ SGD to update the CNN parameters ω, which will lead to both an increase
in the margin and a decrease in the radius Fig. 10.6b. It is worth mentioning that
the two steps (ii) and (iii) can be performed in the same SGD procedure; i.e., their
parameters are jointly updated.
Below, we explain in detail the three steps for minimizing the losses in Eqs. (10.5)
and (10.7), which are derived from our deep model.
(i) Given the model parameters ω and {w, b}, for each sample (xi , yi ), the most
appropriate latent variables h i can be determined by exhaustive searching over all
possible choices,
h i∗ = arg min 1 − wφ(xi ; ω, h i ) + b yi . (10.12)
hi
GPU programming is employed to accelerate the search process. With the updated
latent variables, we further obtain the feature set φ(X ; ω, H ) of all the training data.
(ii) Given φ(X ; ω, H ) and the CNN parameters ω, batch stochastic gradient
descent (SGD) is adopted to update the model parameters in Eqs. (10.5) and (10.7).
In iteration t, a batch Bt ⊂ (X, Y, H ) of k samples is chosen. We can obtain the
gradients of the max-margin classifier with respect to parameters {w, b},
∂L
=w−λ yi φ(xi ; ω, h i ) max 0, 1 − w T φ(xi ; ω, h i ) + b yi , (10.13)
∂w
(xi ,yi ,h i )∈Bt
∂L
= −2λ yi max 0, 1 − w T φ(xi ; ω, h i ) + b yi , (10.14)
∂b
(xi ,yi ,h i )∈Bt
φi = κi j φ(x j ; ω, h j ). (10.16)
j
With κi and φi , based on batch SGD, the derivative of the spatiotemporal CNNs is
148 10 Human Activity Understanding
∂ L2 ∂φ(xi ; ω, h i )
= 4η (κi φ(xi ; ω, h i ) − φi )T
∂ω ∂ω
(xi ,yi ,h i )∈Bt
(10.17)
∂φ(xi ; ω, h i )
− 2λ yi w T
max 0, 1 − w T φ(xi ; ω, h i ) + b yi .
∂ω
When α = 0, we first update the mean φ̄ω in Eq. (10.8) based on φ(X ; ω, H ) and
then compute the derivative of the relaxed loss in Eq. (10.7) as
∂ L3 T ∂φ(xi ; ω, h i )
= 4η φ(xi ; ω, h i ) − φ̄ω
∂ω ∂ω
(xi ,yi ,h i )∈Bt
(10.18)
∂φ(xi ; ω, h i )
− 2λ w T yi max 0, 1 − w T φ(xi ; ω, h i ) + b yi .
∂ω
10.4.3 Inference
Given an input video xi , the inference task aims to recognize the category of the
activity, which can be formulated as the minimization of Fy (xi , ω, h) with respect
to the activity label y and the latent variables h,
where {w y , b y } denotes the parameters of the max-margin classifier for the activity
category y. Note that the possible values for y and h are discrete. Thus, the problem
above can be solved by searching across all the labels y(1 ≤ y ≤ C) and calculating
the maximum Fy (xi , ω, h) by optimizing h. To find the maximum of Fy (xi , ω, h), we
enumerate all possible values of h and calculate the corresponding Fy (xi , ω, h) via
150 10 Human Activity Understanding
10.5 Experiments
To validate the advantages of our model, experiments are conducted on several chal-
lenging public datasets, i.e., the CAD-120 dataset [21], the SBU Kinect Interaction
dataset [29], and a larger dataset newly created by us, namely, the Office Activity
(OA) dataset. Moreover, we introduce a more comprehensive dataset in our experi-
ments by combining five existing datasets of RGB-D human activity. In addition to
demonstrating the superior performance of the proposed model compared to other
state-of-the-art methods, we extensively evaluate the main components of our frame-
work.
The CAD-120 dataset comprises 120 RGB-D video sequences of humans performing
long daily activities in 10 categories and has been widely used to test 3D human
activity recognition methods. These activities recorded via the Microsoft Kinect
sensor were performed by four different subjects, and each activity was repeated
three times by the same actor. Each activity has a long sequence of subactivities,
which vary significantly from subject to subject in terms of length, order, and the
way the task is executed. The challenges of this dataset also lie in the large variance
in object appearance, human pose, and viewpoint. Several sampled frames and depth
maps from these 10 categories are exhibited in Fig. 10.7a.
The SBU dataset consists of 8 categories of two-person interaction activities,
including a total of approximately 300 RGB-D video sequences, i.e., approximately
40 sequences for each interaction category. Although most interactions in this dataset
are simple, it is still challenging to model two-person interactions by considering the
following difficulties: (i) one person is acting, and the other person is reacting in
most cases; (ii) the average frame length of these interactions is short (ranging from
20 to 40 s), and (iii) the depth maps have noise. Figure 10.7b shows several sampled
frames and depth maps of these 8 categories.
The proposed OA dataset is more comprehensive and challenging than the existing
datasets, and it covers regular daily activities that take place in an office. To the best
of our knowledge, it is the largest activity dataset of RGB-D videos, consisting of
1180 sequences. The OA database is publicly accessible.1 Three RGB-D sensors (i.e.,
Microsoft Kinect cameras) are utilized to capture data from different viewpoints, and
1 http://vision.sysu.edu.cn/projects/3d-activity/.
10.5 Experiments 151
Fig. 10.7 Activity examples from the testing databases. Several sampled frames and depth maps
are presented. a CAD-120, b SBU, c OA1, and d OA2 show two activities of the same category
selected from the three databases
more than 10 actors are involved. The activities are captured in two different offices
to increase variability, and each actor performs the same activity twice. Activities
performed by two subjects who interact are also included. Specifically, the dataset
is divided into two subsets, each of which contains 10 categories of activities: OA1
(complex activities by a single subject) and OA2 (complex interactions by two sub-
jects). Several sampled frames and depth maps from OA1 and OA2 are shown in
Fig. 10.7c, d, respectively.
Empirical analysis is used to assess the main components of the proposed deep struc-
tured model, including the latent structure, relaxed radius-margin bound, model pre-
training, and depth/grayscale channel. Several variants of our method are suggested
by enabling/disabling certain components. Specifically, we denote the conventional
3D convolutional neural network with the softmax classifier as Softmax + CNN, the
3D CNN with the SVM classifier as SVM + CNN, and the 3D CNN with the relaxed
radius-margin bound classifier as R-SVM + CNN. Analogously, we refer to our deep
152 10 Human Activity Understanding
0.1
0
0 50 100 150 200 250 300
Train iteration
model as LCNN and then define Softmax + LCNN, SVM + LCNN, and R-SVM +
LCNN accordingly.
Latent Model Structure. In this experiment, we implement a simplified version
of our model by removing the latent structure and comparing it with our full model.
The simplified model is actually a spatiotemporal CNN model with both 3D and
2D convolutional layers, and this model uniformly segments the input video into M
subactivities. Without the latent variables to be estimated, the standard backpropa-
gation algorithm is employed for model training. We execute this experiment on the
CAD120 dataset. Figure 10.8 shows the test error rates with different iterations of the
simplified model (i.e., R-SVM + CNN) and the full version (i.e., R-SVM + LCNN).
Based on the results, we observe that our full model outperforms the simplified
model in both error rate and training efficiency. Furthermore, the structured models
with model pretraining, i.e., Softmax + LCNN, SVM + LCNN, R-SVM + LCNN,
achieve 14.4%/11.1%/12.4% better performance than the traditional CNN models,
i.e., Softmax + CNN, SVM + CNN, R-SVM + CNN, respectively. The results clearly
demonstrate the significance of incorporating the latent temporal structure to address
the large temporal variations in human activities.
Pretraining. To justify the effectiveness of pretraining, we discard the parame-
ters trained on the 2D videos and learn the model directly on the grayscale-depth
data. Then, we compare the test error rate of the models with/without pretraining.
To analyze the rate of convergence, we adopt the R-SVM + LCNN framework and
allow with/without pretraining to share the same learning rate settings for a fair
comparison. Using the CAD120 dataset, we plot the test error rates with increasing
iteration numbers during training in Fig. 10.9. The model using pretraining converges
in 170 iterations, while the model without pretraining requires 300 iterations, and the
model with pretraining converges to a much lower test error rate (9%) than that with-
out pretraining (25%). Furthermore, we also compare the performance with/without
pretraining using SVM + LCNN and R-SVM + LCNN. We find that pretraining is
10.5 Experiments 153
1
with pretraining
0.9
without pretraining
0.8
0.7
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250 300
Train iteration
0.62 0.03 0.02 0.08 0.03 0.05 0.08 0.03 0.05 0.62 0.09 0.09 0.04 0.06 0.09
0.68 0.25 0.07 0.31 0.53 0.12 0.03
0.02 0.67 0.05 0.17 0.05 0.05 0.48 0.09 0.26 0.17
0.70 0.17 0.13 0.12 0.09 0.58 0.05 0.05 0.07 0.04
0.03 0.05 0.68 0.08 0.15 0.17 0.48 0.22 0.12
0.12 0.08 0.72 0.08 0.02 0.03 0.31 0.47 0.12 0.02 0.03
0.02 0.10 0.77 0.12 0.34 0.55 0.03 0.07
0.08 0.08 0.82 0.02 0.02 0.32 0.56 0.11
0.17 0.08 0.08 0.62 0.05 0.02 0.10 0.02 0.21 0.52 0.14
0.08 0.08 0.17 0.67 0.08 0.27 0.65
Fig. 10.10 Confusion matrices of our proposed deep structured model on the a CAD120, b SBU,
c OA1, and d OA2 datasets. It is evident that these confusion matrices all have a strong diagonal
with few errors
effective in reducing the test error rate. Actually, the test error rate with pretraining
is approximately 15% less than that without pretraining (Fig. 10.9).
Relaxed Radius-margin Bound. As described above, the training data for
grayscale-depth human activity recognition are scarce. Thus, for the last fully con-
nected layer, we adopt the SVM classifier by incorporating the relaxed radius-margin
bound, resulting in the R-SVM + LCNN model. To justify the role of the relaxed
radius-margin bound, Table 10.1 compares the accuracy of Softmax + LCNN, SVM
+ LCNN, and R-SVM + LCNN on all datasets with the same experimental settings.
154 10 Human Activity Understanding
Table 10.1 Average accuracy of all categories on four datasets with different classifiers
Softmax + LCNN (%) SVM + LCNN (%) R-SVM + LCNN (%)
CAD120 82.7 89.4 90.1
SBU 92.4 92.8 94.0
OA1 60.7 68.5 69.3
OA2 47.0 53.7 54.5
Merged_50 30.3 36.4 37.3
Merged_4 87.1 88.5 88.9
Table 10.2 Channel analysis of the three datasets. Average accuracy of all categories is reported
Grayscale (%) Depth (%) Grayscale + depth (%)
OA1 60.4 65.2 69.3
OA2 46.3 51.1 54.5
Merged_50 27.8 33.4 37.3
Merged_4 81.7 85.5 88.9
References
1. L. Lin, K. Wang, W. Zuo, M. Wang, J. Luo, L. Zhang, A deep structured model with radius-
margin bound for 3D human activity recognition. Int. J. Comput. Vis. 118(2), 256–273 (2016)
2. L. Xia, C. Chen, J.K. Aggarwal, View invariant human action recognition using histograms of
3d joints, in CVPRW, pp 20–27 (2012)
3. O. Oreifej, Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition from
depth sequences, in CVPR, pp. 716–723 (2013)
4. L. Xia, J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity recognition
using depth camera, in CVPR, pp. 2834–2841 (2013)
5. J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth
cameras, in: CVPR, pp. 1290–1297 (2012)
6. Y. Wang, G. Mori, Hidden part models for human action recognition: Probabilistic vs. max-
margin. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1310–1323 (2011)
7. J.M. Chaquet, E.J. Carmona, A. Fernandez-Caballero, A survey of video datasets for human
action and activity recognition. Comput. Vis. Image Underst. 117(6), 633–659 (2013)
8. Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, L.D. Jackel et al.,
Handwritten digit recognition with a back-propagation network (Adv. Neural Inf. Process,
Syst, 1990)
9. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks.
Science 313(5786), 504–507 (2006)
10. P. Wu, S. Hoi, H. Xia, P. Zhao, D. Wang, C. Miao, Online multimodal deep similarity learning
with application to image retrieval, in ACM Mutilmedia, pp. 153–162 (2013)
11. P. Luo, X. Wang, X. Tang, Pedestrian parsing via deep decompositional neural network, in
ICCV, pp. 2648–2655 (2013)
12. K. Wang, X. Wang, L. Lin, 3d human activity recognition with reconfigurable convolutional
neural networks, in ACM MM (2014)
13. S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition.
IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
14. S. Zhu, D. Mumford, A stochastic grammar of images. Found. Trends Comput. Graph. Vis.
2(4), 259–362 (2007)
15. P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection with discrim-
inatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645
(2010)
16. M.R. Amer, S. Todorovic, Sum-product networks for modeling activities with stochastic struc-
ture, in: CVPR, pp. 1314–1321 (2012)
17. L. Lin, X. Wang, W. Yang, J.H. Lai, Discriminatively trained and-or graph models for object
shape detection. IEEE Trans. Pattern Anal. Mach. Intelli. 37(5), 959–972 (2015)
18. M. Pei, Y. Jia, S. Zhu, Parsing video events with goal inference and intent prediction, in ICCV,
pp. 487–494 (2011)
19. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video
classification with convolutional neural networks, in CVPR (2014)
20. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional
neural networks. Adv. Neural Inf. Process. Syst. 1097–1105, (2012)
21. H.S. Koppula, R. Gupta, A. Saxena, Learning human activities and object affordances from
rgb-d videos. Int. J. Robot. Res. (IJRR) 32(8), 951–970 (2013)
22. F.J. Huang, Y. LeCun, Large-scale learning with svm and convolutional for generic object
categorization, in CVPR, pp. 284–291 (2006)
23. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object
detection and semantic segmentation, in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2014)
24. V. Vapnik, Statistical Learning Theory (John Wiley and Sons, New York, 1998)
25. O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parameters for support
vector machines. Mach. Learn. 46(1–3), 131–159 (2002)
156 10 Human Activity Understanding
26. H. Do, A. Kalousis, Convex formulations of radius-margin based support vector machines, in
ICML (2013)
27. H. Do, A. Kalousis, M. Hilario, Feature weighting using margin and radius based error bound
optimization in svms, in Machine Learning and Knowledge Discovery in Databases, Lecture
Notes in Computer Science, vol 5781, Springer, Berlin Heidelberg, pp 315–329 (2009)
28. P S, K K, S C, Y L, Pedestrian detection with unsupervised multi- stage feature learning, in
CVPR (2013)
29. K. Yun, J. Honorio, D. Chattopadhyay, T.L. Berg, D. Samaras, Two-person interaction detec-
tion using body-pose features and multiple instance learning, in Computer Vision and Pattern
Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE (2012)