Human Centric Visual Analysis Deep Learning

Liang Lin · Dongyu Zhang · Ping Luo ·
Wangmeng Zuo
Human Centric
Visual Analysis
with Deep
Learning
Human Centric Visual Analysis with Deep Learning
Liang Lin Dongyu Zhang
• •
Ping Luo Wangmeng Zuo

•
Human Centric Visual

Analysis with Deep Learning
123
Liang Lin Dongyu Zhang
School of Data and Computer Science School of Data and Computer Science
Sun Yat-sen University Sun Yat-sen University
Guangzhou, Guangdong, China Guangzhou, Guangdong, China
Ping Luo Wangmeng Zuo

School of Information Engineering School of Computer Science
The Chinese University of Hong Kong Harbin Institute of Technology
Hong Kong, Hong Kong Harbin, China
ISBN 978-981-13-2386-7 ISBN 978-981-13-2387-4 (eBook)

https://doi.org/10.1007/978-981-13-2387-4
© Springer Nature Singapore Pte Ltd. 2020
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
herein or for any errors or omissions that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Foreword
When Liang asked me to write the foreword to his new book, I was very happy and
proud to see the success that he has achieved in recent years. I have known Liang
since 2005, when he visited the Department of Statistics of UCLA as a Ph.D.
student. Very soon, I was deeply impressed by his enthusiasm and potential in
academic research during regular group meetings and his presentations. Since 2010,
Liang has been building his own laboratory at Sun Yat-sen University, which is the
best university in southern China. I visited him and his research team in the summer
of 2010 and spent a wonderful week with them. Over these years, I have witnessed
his fantastic success of him and his group, who set an extremely high standard. His
work on deep structured learning for visual understanding has built his reputation as
a well-established professor in computer vision and machine learning. Specifically,
Liang and his team have focused on improving feature representation learning with
several interpretable and context-sensitive models and applied them to many
computer vision tasks, which is also the focus of this book. On the other hand, he
has a particular interest in developing new models, algorithms, and systems for
intelligent human-centric analysis while continuing to focus on a series of classical
research tasks such as face identification, pedestrian detection in surveillance, and
human segmentation. The performance of human-centric analysis has been sig-
nificantly improved by recently emerging techniques such as very deep neural
networks, and new advances in learning and optimization. The research team led by
Liang is one of the main contributors in this direction and has received increasing
attention from both the academy and industry. In sum, Liang and his colleagues did
an excellent job with the book, which is the most up-to-date resource you can find
and a great introduction to human-centric visual analysis with emerging deep
structured learning.
If you need more motivation than that, here is the foreword:
In this book, you will find a wide range of research topics in human-centric
visual analysis including both classical (e.g., face detection and alignment) and
newly rising topics (e.g., fashion clothing parsing), and a series of state-of-the-art
solutions addressing these problems. For example, a newly emerging task, human
parsing, namely, decomposing a human image into semantic fashion/body regions,
v
vi Foreword
is deeply and comprehensively introduced in this book, and you will find not only
the solutions to the real challenges of this problem but also new insights from which
more general models or theories for related problems can be derived.
To the best of our knowledge, to date, a published systematic tutorial or book
targeting this subject is still lacking, and this book will fill that gap. I believe this
book will serve the research community in the following aspects:
(1) It provides an overview of the current research in human-centric visual
analysis and highlights the progress and difficulties. (2) It includes a tutorial in
advanced techniques of deep learning, e.g., several types of neural network
architectures, optimization methods, and techniques. (3) It systematically discusses
the main human-centric analysis tasks on different levels, ranging from face/human
detection and segmentation to parsing and other higher level understanding. (4) It
provides effective methods and detailed experimental analysis for every task as well
as sufficient references and extensive discussions.
Furthermore, although the substantial content of this book focuses on
human-centric visual analysis, it is also enlightening regarding the development of
detection, parsing, recognition, and high-level understanding methods for other AI
applications such as robotic perception. Additionally, some new advances in deep
learning are mentioned. For example, Liang introduces the Kalman normalization
method, which was invented by Liang and his students, for improving and accel-
erating the training of DNNs, particularly in the context of microbatches.
I believe this book will be very helpful and important to academic
professors/students as well as industrial engineers working in the field of vision
surveillance, biometrics, and human–computer interaction, where human-centric
visual analysis is indispensable in analyzing human identity, pose, attributes, and
behaviors. Briefly, this book will not only equip you with the skills to solve the
application problems but will also give you a front-row seat to the development of
artificial intelligence. Enjoy!
Alan Yuille
Bloomberg Distinguished Professor of Cognitive Science
and Computer Science
Johns Hopkins University, Baltimore, Maryland, USA
Preface
Human-centric visual analysis is regarded as one of the most fundamental problems

in computer vision, which augments human images in a variety of application
fields. Developing solutions for comprehensive human-centric visual applications
could have crucial impacts in many industrial application domains such as virtual
reality, human–computer interaction, and advanced robotic perception. For exam-
ple, clothing virtual try-on simulation systems that seamlessly fit various clothes to
the human body shape have attracted much commercial interest. In addition, human
motion synthesis and prediction can bridge virtual and real worlds, facilitating more
intelligent robotic–human interactions by enabling causal inferences for human
activities.
Research on human-centric visual analysis is quite challenging. Nevertheless,
through the continuous efforts of academic and industrial researchers, continuous
progress has been achieved in this field in recent decades. Recently, deep learning
methods have been widely applied to computer vision. The success of deep learning
methods can be partly attributed to the emergence of big data, newly proposed
network models, and optimization methods. With the development of deep learning,
considerable progress has also been achieved in different subtasks of human-centric
visual analysis. For example, in facial recognition, the accuracy of the deep
model-based method has exceeded the accuracy of humans. Other accurate face
detection methods are also based on deep learning models. This progress has
spawned many interesting and practical applications, such as face ID in smart-
phones, which can identify individual users and detect fraudulent authentication
based on faces.
In this book, we will provide an in-depth summary of recent progress in
human-centric visual analysis based on deep learning methods. The book is orga-
nized into five parts. In the first part, Chap. 1 first provides the background of deep
learning methods including a short review of the development of artificial neural
networks and the backpropagation method to give the reader a better understanding
of certain deep learning concepts. We also introduce a new technique for the
training of deep neural networks. Subsequently, in Chap. 2, we provide an overview
of the tasks and the current progress of human-centric visual analysis.
vii
viii Preface
In the second part, we introduce tasks related to how to localize a person in an

image. Specifically, we focus on face detection and pedestrian detection. In Chap. 3,
we introduce the facial landmark localization method based on a cascaded fully
convolutional network. The proposed method first generates low-resolution response
maps to identify approximate landmark locations and then produces fine-grained
response maps over local regions for more accurate landmark localization. We then
introduce the attention-aware facial hallucination method, which generates a
high-resolution facial image from a low-resolution image. This method recurrently
discovers facial parts and enhances them by fully exploiting the global interde-
pendency of facial images. In Chap. 4, we introduce a deep learning model for
pedestrian detection based on region proposal networks and boosted forests.
In the third part, several representative human parsing methods are described. In
Chap. 5, we first introduce a new benchmark for the human parsing task, followed
by a self-supervised structure-sensitive learning method for human parsing. In
Chaps. 6–7, instance-level human parsing and video instance-level human parsing
methods are introduced.
In the fourth part, person verification and face verification are introduced. In
Chap. 8, we describe a cross-modal deep model for person verification. The model
accepts different input modalities and produces prediction. In Chap. 9, we introduce
a deep learning model for face recognition by exploiting unlabeled data based on
active learning.
The last part describes a high-level task and discusses the progress of human
activity recognition.
The book is based on our years of research on human-centric visual analysis.
Since 2010, with grant support from the National Natural Science Foundation of
China (NSFC), we have developed our research plan. Since then, an increasing
number of studies have been conducted in this area. We would like to express our
gratitude to our colleagues and Ph.D. students, i.e., Prof. Xiaodan Liang, Prof.
Guanbin Li, Dr. Pengxu Wei, Dr. Keze Wang, Dr. Tianshui Chen, Dr. Qingxing
Cao, Dr. Guangrun Wang, Dr. Lingbo Liu, and Dr. Ziliang Chen, for their con-
tributions to the research achievements on this topic. It has been our great honor to
work with them on this inspiring topic in recent years.
Guangzhou, China Liang Lin

Contents
Part I Motivation and Overview

1 The Foundation and Advances of Deep Learning . . . . . . . . . . . . . . 3
1.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Formulation of Neural Network . . . . . . . . . . . . . . . . . . 6
1.2 New Techniques in Deep Learning . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Batch Kalman Normalization . . . . . . . . . . . . . . . . . . . 9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Human-Centric Visual Analysis: Tasks and Progress . . . . . . . . . . . 15
2.1 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Facial Landmark Localization . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Conventional Approaches . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Deep-Learning-Based Models . . . . . . . . . . . . . . . . . . . 17
2.3 Pedestrian Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Benchmarks for Pedestrian Detection . . . . . . . . . . . . . . 18
2.3.2 Pedestrian Detection Methods . . . . . . . . . . . . . . . . . . . 19
2.4 Human Segmentation and Clothes Parsing . . . . . . . . . . . . . . . . 21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Part II Localizing Persons in Images

3 Face Localization and Enhancement . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Facial Landmark Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 The Cascaded BB-FCN Architecture . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Backbone Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Branch Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Ground Truth Heat Map Generation . . . . . . . . . . . . . . 34
ix
x Contents
3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.3 Performance Evaluation for Unconstrained Settings . . . 36
3.3.4 Comparison with the State of the Art . . . . . . . . . . . . . 36
3.4 Attention-Aware Face Hallucination . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 The Framework of Attention-Aware Face
Hallucination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 Recurrent Policy Network . . . . . . . . . . . . . . . . . . . . . . 40
3.4.3 Local Enhancement Network . . . . . . . . . . . . . . . . . . . 42
3.4.4 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . 42
3.4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 Pedestrian Detection with RPN and Boosted Forest . . . . . . . . . . . . 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Region Proposal Network for Pedestrian Detection . . . . 49
4.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 Boosted Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Part III Parsing Person in Detail

5 Self-supervised Structure-Sensitive Learning for Human
Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Look into Person Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Self-supervised Structure-Sensitive Learning . . . . . . . . . . . . . . . 62
5.3.1 Self-supervised Structure-Sensitive Loss . . . . . . . . . . . 64
5.3.2 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . 66
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6 Instance-Level Human Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Crowd Instance-Level Human Parsing Dataset . . . . . . . . . . . . . 73
6.3.1 Image Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4 Part Grouping Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.4.1 PGN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.2 Instance Partition Process . . . . . . . . . . . . . . . . . . . . . . 78
Contents xi
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5.2 PASCAL-Person-Part Dataset . . . . . . . . . . . . . . . . . . . 80
6.5.3 CIHP Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.5.4 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7 Video Instance-Level Human Parsing . . . . . . . . . . . . . . . . . . . . . . . 85
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 Video Instance-Level Parsing Dataset . . . . . . . . . . . . . . . . . . . . 86
7.2.1 Data Amount and Quality . . . . . . . . . . . . . . . . . . . . . . 87
7.2.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3 Adaptive Temporal Encoding Network . . . . . . . . . . . . . . . . . . . 87
7.3.1 Flow-Guided Feature Propagation . . . . . . . . . . . . . . . . 90
7.3.2 Parsing R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.3 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . 91
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Part IV Identifying and Verifying Persons

8 Person Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.2 Generalized Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . 101
8.2.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
8.2.2 Connection with Existing Models . . . . . . . . . . . . . . . . 105
8.3 Joint Similarity and Feature Learning . . . . . . . . . . . . . . . . . . . . 106
8.3.1 Deep Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.3.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9 Face Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.3 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.4 Formulation and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 121
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Part V Higher Level Tasks

10 Human Activity Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.2 Deep Structured Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.2.1 Spatiotemporal CNNs . . . . . . . . . . . . . . . . . . . . . . . . . 137
xii Contents
10.2.2 Latent Temporal Structure . . . . . . . . . . . . . . . . . . . . . . 137

10.2.3 Deep Model with Relaxed Radius-Margin Bound . . . . . 139
10.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.3.1 Latent Temporal Structure . . . . . . . . . . . . . . . . . . . . . . 142
10.3.2 Architecture of Deep Neural Networks . . . . . . . . . . . . 143
10.4 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.4.1 Joint Component Learning . . . . . . . . . . . . . . . . . . . . . 146
10.4.2 Model Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.4.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
10.5.1 Datasets and Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 150
10.5.2 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Part I
Motivation and Overview
Human-centric visual analysis is important in the field of computer vision. Typical

applications include face recognition, person reidentification, and pose estimation.
Developing solutions to comprehensive human-centric visual applications has ben-
efited from many industrial application domains such as smart surveillance, virtual
reality, and human–computer interaction.
Most problems of human-centric visual analysis are quite challenging. For exam-
ple, person reidentification involves identifying the same person in images/videos
captured by different cameras based on features such as clothing and body motions.
However, due to the large variations in human appearance, significant background,
irrelevant motions, scale and illumination changes, it is difficult to accurately reiden-
tify the same person among thousands of images. For face recognition in the wild,
illumination variation and occlusion can significantly reduce recognition accuracy.
Similar problems exist in other human-centric visual analysis tasks.
On the other hand, the past decade has witnessed the rapid development of feature
representation learning, especially deep neural networks, which greatly enhanced the
already rapidly developing field of computer vision. The emerging deep models that
trained on large-scale databases have effectively improved systems performance in
practical applications. The most famous example is AlphaGo, which beat the top
human Go players in 2015.
In this book, we introduce the recent development of deep neural networks
in several human-centric visual analysis problems. The book is divided into five
parts. In the first part, we briefly review the foundation of the deep neural network
and introduce some recently developed advanced new techniques. We also provide
an overview of human-centric visual analysis problems in this part. Then, from
Part II to Part V, we introduce our work on typical human-centric visual analy-
sis problems including face detection, face recognition and verification, pedestrian
detection, pedestrian recognition, human parsing, and action recognition.
Chapter 1
The Foundation and Advances of Deep
Learning
Abstract The past decade has witnessed the rapid development of feature represen-
tation learning, especially deep learning. Deep learning methods have achieved great
success in many applications, including computer vision, and natural language pro-
cessing. In this chapter, we present a short review of the foundation of deep learning,
i.e., artificial neural network, and introduce some new techniques in deep learning.
1.1 Neural Networks
Neural networks, the foundation of deep learning models, are biologically inspired
systems that are intended to simulate the way in which the human brain processes
information. The human brain consists of a large number of neurons that are highly
connected by synapses. The arrangement of neurons and the strengths of the indi-
vidual synapses, determined by a complex chemical process, establish the function
of the neural network of the human brain. Neural networks are excellent tools for
finding patterns that are far too complex or numerous for a human programmer to
extract and teach the machine to recognize.
The beginning of neural networks can be traced to the 1940s, when the single
perceptron neuron was proposed, and only over the past several decades have neural
networks become a major part of artificial intelligence. This is due to the development
of backpropagation, which allows multilayer perceptron neural networks to adjust
the weights of neurons in situations where the outcome does not match what the
creator is hoping for. In the following, we briefly review the background of neural
networks, including the perceptron, multilayer perceptron, and the backpropagation
algorithm.
1.1.1 Perceptron
The perceptron occupies a special place in the historical development of neural

networks. Because the importance of different inputs are not the same, perceptrons
© Springer Nature Singapore Pte Ltd. 2020 3

L. Lin et al., Human Centric Visual Analysis with Deep Learning,
https://doi.org/10.1007/978-981-13-2387-4_1
4 1 The Foundation and Advances of Deep Learning
introduce weights w j to each input to account for the difference. The perceptron sums
the weighted inputs and produces a single binary output with its activation function,
f (x), which is defined as

1, i f j wjxj + b > 0
f (x) = (1.1)
0, other wise.
where w j is the weight and b is the bias, which shifts the decision boundary away
from the origin.
The perceptron with one output can only be used for binary classification prob-
lems. As with most other techniques for training linear classifiers, the perceptron
naturally generalizes to multiclass classification. Here, the input x and the output y
are drawn from arbitrary sets. A feature representation function f (x, y) maps each
possible input/output pair to a finite-dimensional real-valued feature vector. The fea-
ture vector is multiplied by a weight vector w, but the resulting score is now used to
choose among many possible outputs:
ŷ = arg max f (x, y) · w. (1.2)

y
Perceptron neurons are a type of linear classifier. If the dataset is linearly separable,
then the perceptron network is guaranteed to converge. Furthermore, there is an upper
bound on the number of times that the perceptron will adjust its weights during the
training. Suppose that the input vectors from the two classes can be separated by a
hyperplane with a margin γ, and let R denote the maximum norm of an input vector.
1.1.2 Multilayer Perceptron
The multilayer perceptron (MLP) is a class of feedforward artificial neural networks

consisting of at least three layers of nodes. Except for the input nodes, each node is
a neuron that uses a nonlinear activation function. Its multiple layers and nonlinear
activation distinguish MLP from a linear perceptron. MLP has three basic features:
(1) Each neuron in the network includes a differentiable nonlinear activation function.
(2) The network contains one or more hidden layers, expect for the input and output
nodes. (3) The network exhibits a high degree of connectivity.
A typical MLP architecture is shown in Fig. 1.1. The network contains one input
layer, two hidden layers, and an output layer. The network is fully connected, which
means that a neuron in any layer of the network is connected to all the neurons in the
previous layer. The first hidden layer is fed from the input layer, and its outputs are in
turn applied to the next hidden layer, and this process is repeated for the remainder
of the MLP neural network.
Each neuron in the MLP network includes a differentiable nonlinear activation
function. The sigmoid function is commonly used in MLP. The activation function
1.1 Neural Networks 5
Input Output
Layer Layer
…
…
…
…
First Second
Hidden Hidden
Layer Layer
Fig. 1.1 Illustration of a typical neural network
of sigmoid neurons is defined as
1
f (x) = , (1.3)
1 + exp{−(w · x + b)}
where x is the input vector and w is the weight vector. With the sigmoid function, the
output of the neuron is no longer just the binary value 1 or 0. In general, the sigmoid
function is real-valued, monotonic, smooth, and differentiable, having a nonnegative
first derivative that is bell shaped. The smoothness of the sigmoid function means
that small changes w j in the weights and b in the bias will produce a small change
out put, which is well approximated by
∂out put ∂out put
out put ≈ w j + b, (1.4)
j
∂w j ∂b
where the sum is overall weights, w j , and ∂out put

∂w j
and ∂out
∂b
put
denote the partial
derivates of the output with respect to w j and b, respectively. out put is a linear
function of w j and b. This linearity makes it easy to choose small changes in the
weights and biases to achieve the desired small change in the output, thus making it
considerably easier to determine how changing the weights and bias will change the
output.
Solve “XOR” problem with MLP. Linear problems can be solved with a single-
layer perceptron. However, if the dataset is not linearly separable, no approximate
solution will be gradually approached by a single perceptron neuron. For example,
Fig. 1.2 shows the typical “XOR” function, which is a nonlinear function and cannot
Output Layer 1
w=-2 w=1
Hidden
Layer 1 1
1
w=0.5 w=1
w=1 w=0.5
Input
Layer
1
“XOR” problem Solution of “XOR” problem
with perception
Fig. 1.2 Left: The illusion of “XOR” Problem. Right: The solution of “XOR” problem with per-
ceptions
be solved by the single-layer perceptron. In this case, we need to use an MLP to

solve this problem.
1.1.3 Formulation of Neural Network
To conveniently describe the neural network, we use the following parameter settings.
Let n l denote the total number of layers in the neural network, and let L l denote the
lth layer. Thus, L 1 and L nl are the input layer and the output layer, respectively.
We use (W, b) = (W (1) , b(1) , W (2) , b(2) , . . .) to denote the parameters of the neural
network, where Wi(l) j denotes the parameters of connections between unit j in layer
l and unit i in layer l + 1. Additionally, bi(l) is the bias associated with unit i in layer
l + 1. Thus, in this case, W (1) ∈ R3×3 , and W (2) ∈ R1×3 . We use ai(l) to denote the
activation of unit i in layer l. Given a fixed setting of the parameters (W, b), the
neural network defines a hypothesis as h W,b (x). Specifically, the computation that
this neural network represents is given by

a1(2) = f W11 (1)
x1 + W12(1) (1)
x2 + W13 x3 + b1(1) ,

a2(2) = f W21 (1)
x1 + W22(1) (1)
x2 + W23 x3 + b2(1) ,
(1.5)
a3(2) = f W31 (1)
x1 + W32(1) (1)
x2 + W33 x3 + b3(1) ,

h W,b (x) = a1(3) = f W11 (2) (2)
a1 + W12 (2) (2)
a2 + W13 (2) (2)
a3 + b1(2) .
1.1 Neural Networks 7
Let z i(l) denote the total weighted sum of inputs to unit i in layer l, including

the bias term (e.g., z i(2) = nj=1 Wi(1) (1) (l) (l)
j x j + bi ), such that ai = f (z i ). If we
extend the activation function f (·) to apply to vectors in an elementwise fashion
as f ([z1, z2, z3]) = [ f (z1), f (z2), f (z3)], then we can write the above equations
more compactly as
z (2) = W (1) x + b(1) ,
a (2) = f (z (2) ),
(1.6)
z (3) = W (2) a (2) + b(2) ,
h W,b (x) = a (3) = f (z (3) ).
We call this step forward propagation. Generally, recalling that we also use a(1) = x
to also denote the values from the input layer, then given layer l’s activations a(l),
we can compute layer (l + 1) s activations a (l+1) as
z (l+1) = W (l) a (l) + b(l) ,

(1.7)
a (l+1) = f z (l+1) .
1.2 New Techniques in Deep Learning
Compared with the traditional MLP, the new neural networks are generally deeper,
and it is more difficult to optimize these neural networks by backpropagation. Thus,
many new techniques have been proposed to smooth the network training, such as
batch normalization (BN) and batch Kalman normalization [1].
1.2.1 Batch Normalization
BN is a technique for improving the performance and stability of neural networks.

This technique was introduced in Ioffe & Szegedy’s 2015 paper. Rather than just
normalizing the inputs to the network, BN normalizes the inputs to layers within the
network. The benefits of BN are as follows:
• Networks are trained faster: Although each training iteration will be slower because
of the extra normalization calculations during the forward pass and the additional
hyperparameters to train during backpropagation, it should converge much more
quickly; thus, training should be faster overall.
• Higher learning rates: Gradient descent generally requires small learning rates for
the network to converge. As networks become deeper, gradients become smaller
during backpropagation and thus require even more iterations. Using BN allows
much higher learning rates, thereby increasing the speed at which networks train.
• Easier to initialize: Weight initialization can be difficult, particularly when creating

deeper networks. BN helps reduce the sensitivity to the initial starting weights.
Rather than whitening the features in layer inputs and outputs jointly, BN nor-
malizes each scalar feature independently by making it have a mean of zero and
variance of 1. For a layer with d-dimensional input x = {x(1), . . . , x(d)}, in BN, we
normalize each dimension
x (k) − E[x (k) ]
x̂ (k) = , (1.8)
V ar [x k ]
where the expectation and variance are computed over the training dataset.
Then, for each activation x (k) , a pair of parameters γ (k) , β (k) are introduced to
scale and shift the normalized value as
y (k) = γ (k) x̂ (k) + β (k) . (1.9)
Algorithm 1.1: Training and Inference with Batch Normalization

Input: Values of x over a minibatch: B = x1...m ; Parameters to be learned: γ, β
Output: yi = B Nγ,β (xi )
1 k
m
uB ← xi
m
i=1
1
m
σB
2
← (xi − u B )
m
i=1
xi − u B
xî ←
σB
2 +
yi ← γ xî + β ≡ B Nγ,β (xi )
Consider a minibatch B of size m. Since the normalization is applied to each

activation independently, let us focus on a particular activation x (k) and omit k for
clarity. We have m values of this activation in the minibatch:
B = x1...m . (1.10)
Let the normalized values be x̂1...m , and let their linear transformation be y1...m . We
refer to the transform
B Nr,β : x1...m → y1...m (1.11)
as the BN. We present the BN transform in Algorithm 1. In this algorithm, is a

constant added to the minibatch variance for numerical stability.
1.2 New Techniques in Deep Learning 9
1.2.2 Batch Kalman Normalization
Although the significance of BN has been demonstrated in many previous works,

its drawback cannot be neglected, i.e., its effectiveness diminishes when a small
minibatch is present in training. Consider a DNN that consists of a number of layers
from bottom to top. In the traditional BN, the normalization step seeks to eliminate the
change in the distributions of its internal layers by reducing their internal covariant
shifts. Prior to normalizing the distribution of a layer, BN first estimates its statistics,
including the means and variances. However, it is impractical to expect that the
bottom layer of the input data can be pre-estimated on the training set because the
representations of the internal layers keep changing after the network parameters
have been updated in each training step. Hence, BN handles this issue with the
following schemes. (i) During the model training, it approximates the population
statistics by using the batch sample statistics in a minibatch. (ii) It retains the moving
average statistics in each training iteration, and it employs them during the inference.
However, BN has a limitation, namely, it is limited by the memory capacity of
computing platforms (e.g., GPUs), especially when the network size and image size
are large. In this case, the minibatch size is not sufficient to approximate the statistics,
causing them to have bias and noise. Additionally, the errors would be amplified when
the network becomes deeper, degenerating the quality of the trained model. Negative
effects also exist in the inference, where the normalization is applied for each testing
sample. Furthermore, in the BN mechanism, the distribution of a certain layer could
vary along with the training iteration, which limits the stability of the convergence
of the model.
Recently, an extension of BN, called batch renormalization (BRN) [2], has been
proposed to improve the performance of BN when the minibatch size is small. BKN
advances the existing solutions by achieving a more accurate estimation of the statis-
tics (means and variances) of the internal representations in DNNs. In contrast to
BN and BRN, where the statistics are estimated by only measuring the minibatches
within a certain layer, i.e., they considered each layer in the network as an isolated
subsystem, BKN shows that the estimated statistics have strong correlations among
the sequential layers. Moreover, the estimations can be more accurate by jointly con-
sidering its preceding layers in the network, as illustrated in Fig. 1.3b. By analogy,
the proposed estimation method shares merits with the Kalman filtering process [3].
BKN performs two steps in an iterative manner. In the first step, BKN estimates
the statistics of the current layer conditioned on the estimations of the previous
layer. In the second step, these estimations are combined with the observed batch
sample means and variances calculated within a minibatch. These two steps are
efficient in BKN. Updating the current estimation by previous states brings negligible
extra computational cost compared to the traditional BN. For example, in recent
advanced deep architectures such as residual networks, the feature representations
have a maximum number of 2048 dimensions (channels), and the extra cost is the
matrix-vector product by transforming a state vector (representing the means and
(a) (b)
Fig. 1.3 a illustrates the distribution estimation in the conventional batch normalization (BN),
where the minibatch statistics, μk and k , are estimated based on the currently observed minibatch
at the kth layer. For clarity of notation, μk and k indicate the mean and the covariance matrix,
respectively. Note that only the diagonal entries are used in normalization. X and X represent the
internal representation before and after normalization. In b, batch Kalman normalization (BKN)
provides a more accurate distribution estimation of the kth layer by aggregating the statistics of the
preceding (k-1)th layer
Fig. 1.4 Illustration of the proposed batch Kalman normalization (BKN). At the (k-1)th layer
of a DNN, BKN first estimates its statistics (means and covariances), μ̂k−1|k−1 , and ˆ k−1|k−1 .
Additionally, the estimations in the kth layer are based on the estimations of the (k-1)th layer,
where these estimations are updated by combining with the observed statistics of the kth layer. This
process treats the entire DNN as a whole system, in contrast to existing works that estimated the
statistics of each hidden layer independently
variances) with a maximum number of 2048 dimensions into a new state vector and
then combining with the current observations (Fig. 1.4).
1.2.2.1 Batch Kalman Normalization Method
Let x k be the feature vector of a hidden neuron in the kth hidden layer of a DNN, such
as a pixel in the hidden convolutional layer of a CNN. BN normalizes the values of x k
by using a minibatch of m samples, B = {x1k , x2k , ..., xmk }. The mean and covariance
of x k are approximated by
1 k
m
Sk ← (x − x̄ k )(xik − x̄ k )T (1.12)
m i=1 i
1.2 New Techniques in Deep Learning 11
and
1 k
m
x̄ k ← x . (1.13)
m i=1 i
x −x̄
k k
We have x̂ k ← √ i , where diag(·) denotes the diagonal entries of a matrix,
diag(S k )
k
i.e., the variances of x . Then, the normalized representation is scaled and shifted
to preserve the modeling capacity of the network, y k ← γ x̂ k + β, where γ and β
are parameters that are optimized during training. However, a minibatch with a
moderately large size is required to estimate the statistics in BN. It is compelling to
explore better estimations of the distribution in a DNN to accelerate training. Assume
that the true values of the hidden neurons in the kth layer can be represented by the
variable x k , which is approximated by using the values in the previous layer x k−1 .
We have
x k = Ak x k−1 + u k , (1.14)
where Ak is a state transition matrix that transforms the states (features) in the
previous layer to the current layer. Additionally, u k is a bias that follows a Gaussian
distribution with zero mean and unit variance. Note that Ak could be a linear transition
between layers. This is reasonable because our purpose is not to accurately compute
the hidden features in a certain layer given those in the previous layer but rather to
draw a connection between layers to estimate the statistics.
As the above true values of x k exist but are not directly accessible, they can be
measured by the observation z k with a bias term vk :
z k = x k + vk , (1.15)
where z k indicates the observed values of the features in a minibatch. In other words,
to estimate the statistics of x k , previous studies only consider the observed value
of z k in a minibatch. BKN takes the features in the previous layer into account.
To this end, we compute the expectation on both sides of Eq. (1.14), i.e., E[x k ] =
E[Ak x k−1 + u k ], and have
μ̂k|k−1 = Ak μ̂k−1|k−1 , (1.16)
where μ̂k−1|k−1 denotes the estimation of the mean in the (k-1)th layer, and μ̂k|k−1
is the estimation of the mean in the kth layer conditioned on the previous layer. We
call μ̂k|k−1 an intermediate estimation of the layer k because it is then combined
with the observed values to achieve the final estimation. As shown in Eq. (1.17),
the estimation in the current layer μ̂k|k is computed by combining the intermediate
estimation with a bias term, which represents the error between the observed values
z k and μ̂k|k−1 . Here, z k indicates the observed mean values, and we have z k = x k .
Additionally, q k is a gain value indicating how much we reply on this bias.
μ̂k|k = μ̂k|k−1 + q k (z k − μ̂k|k−1 ). (1.17)

Algorithm 1.2: Training and Inference with Batch Kalman Normalization

Input: Values of feature maps {x1...m } in the k th layer; μ̂k−1|k−1 , ˆ k−1|k−1 in the
(k−1)th layer; parameters γ and β; moving mean μ and moving variance ; moving
momentum α; Kalman gain q k and transition matrix Ak .
Output: {yik = BKN(xik )}; updated μ, ; statistics μ̂k|k and ˆ k|k in the current layer.
Train:
1 k 1 k
m m
x̄ k ← xi , S k ← (xi − x̄ k )(xik − x̄ k )T
m m
i=1 i=1
p ← 1 − qk ,
k μ̂k|k−1 ← Ak μ̂k−1|k−1 , μ̂k|k ← pk μ̂k|k−1 + q k x̄ k
ˆ k|k−1 ← Ak
ˆ k−1|k−1 (Ak )T + R
ˆ k|k ← pk
ˆ k|k−1 +q k S k + p k q k (x̄ k −μ̂k|k−1 )(x̄ k −μ̂k|k−1 )T
x k − μ̂k|k k
yik ←
i γ + βk
diag( ˆ k|k )
moving average :
ˆ k|k )
μ := μ + α(μ − μ̂k|k ), := + α( −
Inference: yinference ← √ x−μ γ+β
diag()
Similarly, the estimations of the covariances can be achieved by calculating

ˆ k|k−1 = Cov(x k − μ̂k|k−1 ) and
ˆ k|k = Cov(x k − μ̂k|k ), where Cov(·) represents
the definition of the covariance matrix. By introducing p k = 1 − q k and z k = x k
and combining the above definitions with Eqs. (1.16) and (1.17), we have the follow-
ing update rules to estimate the statistics, as shown in Eq. (1.18). Its proof is given
in the Appendix.
⎧ k|k−1
⎪
⎪ μ̂ = Ak μ̂k−1|k−1 ,
⎪
⎪ μ̂ k|k
= p k μ̂k|k−1 + q k x̄ k ,
⎨
ˆ k|k−1
= Ak ˆ k−1|k−1 (Ak )T + R, (1.18)
⎪
⎪ ˆ ˆ
⎪
⎪ =p
k|k k k|k−1
+ q k Sk
⎩
+ p q (x̄ −μ̂k|k−1 )(x̄ k −μ̂k|k−1 )T ,
k k k
where ˆ k|k denote the intermediate and the final estimations of the covari-
ˆ k|k−1 and
ance matrices in the kth layer, respectively. R is the covariance matrix of the bias u k
in Eq. (1.14). Note that it is identical for all the layers. S k are the observed covariance
matrices of the minibatch in the kth layer. In Eq. (1.18), the transition matrix Ak , the
covariance matrix R, and the gain value q k are parameters that are optimized during
training. In BKN, we employ μ̂k|k and ˆ k|k to normalize the hidden representation.
Please reference the 2 for the detail of Batch Kalman Normalization.
References 13
References
1. W. Guangrun, P. Jiefeng, L. Ping, W. Xinjiang, L. Liang, Batch kalman normalization: towards

training deep neural networks with micro-batches, arXiv preprint arXiv:1802.03133 (2018)
2. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing inter-
nal covariate shift, arXiv preprint arXiv:1502.03167 (2015)
3. R.E. Kalman et al., A new approach to linear filtering and prediction problems. J. Basic Eng.
82(1), 35–45 (1960)
Chapter 2
Human-Centric Visual Analysis: Tasks
and Progress
Abstract The research of human-centric visual analysis has achieved considerable

progress in recent years. In this chapter, we briefly review the tasks of human-centric
visual analysis, including face detection, facial landmark localization, pedestrian
detection, human segmentation, clothes parsing, etc.
2.1 Face Detection
As one key step toward many subsequent face-related applications, face detection
has been extensively studied in the computer vision literature. Early efforts in face
detection date back to as early as the beginning of the 1970s, where simple heuristic
and anthropometric techniques [1] were used. Prior to 2000, despite progress [2, 3],
the practical performance of face detection was far from satisfactory. One genuine
breakthrough was the Viola-Jones framework [4], which applied rectangular Haar-
like features in a cascaded AdaBoost classifier to achieve real-time face detection.
However, this framework has several critical drawbacks. First, its feature size was
relatively large. Typically, in a 24 × 24 detection window, the number of Haar-
like features was 160 thousand [5]. Second, this framework is not able to effectively
handle non-frontal faces in the wild. Many works have been proposed to address these
issues of the Viola-Jones framework and achieve further improvements. First, more
complicated features (such as HOG [6], SIFT [7], SURF [8]) were used. For example,
Liao et al. [9] proposed a new image feature called normalized pixel difference
(NPD), which is computed as the difference to sum ratio between two pixel values.
Second, to detect faces with various poses, some works combined multiple detectors,
each of which was trained for a specific view. As a representative work, Zhu et al.
[10] applied multiple deformable part models to capture faces with different views
and expressions.
Recent years have witnessed advances in face detection using deep learning meth-
ods, which significantly outperform traditional computer vision methods. For exam-
ple, Li et al. [11] proposed a cascade architecture built on CNNs, which can quickly
reject the background regions in the fast low-resolution stage and effectively calibrate
the bounding boxes of face proposal in the high-resolution stages. Following a similar

https://doi.org/10.1007/978-981-13-2387-4_2
16 2 Human-Centric Visual Analysis: Tasks and Progress
procedure, Zhang et al. [12] leveraged a cascaded multitask architecture to enhance

the face detection performance by exploiting the inherent correlation between detec-
tion and alignment. However, these single-scale detectors had to perform multiscale
testing on image pyramids, which is time consuming. To reduce the level number
of image pyramids, Hao et al. [13] and Liu et al. [14] proposed an efficient CNN
that predicts the scale distribution histogram of the faces and guides the zoom-in and
zoom-out of the images or features. Recently, many works have adapted the generic
object detector Faster R-CNN [15] to perform face detection. For example, Wan
et al. [16] bootstrapped Faster R-CNN with hard negative mining and achieved a sig-
nificant improvement on the representative face detection benchmark FDDB [17].
Despite achieving progress, these methods generally failed to detect tiny faces in
unconstrained conditions. To address this issue, Bai et al. [18] first generated a clear
high-resolution face from a blurry small one by adopting a generative adversarial
network and then performed the face detection.
2.2 Facial Landmark Localization
2.2.1 Conventional Approaches
Facial landmark localization has long been attempted in computer vision, and a
large number of approaches have been proposed for this purpose. The conventional
approaches for this task can be divided into two categories: template fitting methods
and regression-based methods.
Template fitting methods build face templates to fit the input face appearance. A
representative work is the active appearance model (AAM) [19], which attempts
to estimate model parameters by minimizing the residuals between the holistic
appearance and an appearance model. Rather than using holistic representations,
a constrained local model (CLM) [20] learns an independent local detector for each
facial keypoint and a shape model for capturing valid facial deformations. Improved
versions of CLM primarily differ from each other in terms of local detectors. For
instance, Belhumeur et al. [21] detected facial landmarks by employing SIFT fea-
tures and SVM classifiers, and Liang et al. [22] applied AdaBoost to the HAAR
wavelet features. These methods are generally superior to the holistic methods due
to the robustness of patch detectors against illumination variations and occlusions.
Regression-based facial landmark localization methods can be further divided
into direct mapping techniques and cascaded regression models. The former directly
maps local or global facial appearances to landmark locations. For example, Dantone
et al. [23] estimated the absolute coordinates of facial landmarks directly from an
ensemble of conditional regression trees trained on facial appearances. Valstar et al.
[24] applied boosted regression to map the appearances of local image patches to the
positions of corresponding facial landmarks. Cascaded regression models [25–31]
formulate shape estimation as a regression problem and make predictions in a cas-
2.2 Facial Landmark Localization 17
caded manner. These models typically start from an initial face shape and iteratively
refine the shape according to learned regressors, which map local appearance fea-
tures to incremental shape adjustments until convergence is achieved. Cao et al. [25]
trained a cascaded nonlinear regression model to infer an entire facial shape from
an input image using pairwise pixel-difference features. Burgos–Artizzu et al. [32]
proposed a novel cascaded regression model for estimating both landmark positions
and their occlusions using robust shape-indexed features. Another seminal method is
the supervised descent method (SDM) [27], which uses SIFT features extracted from
around the current shape and minimizes a nonlinear least-squares objective using the
learned descent directions. All these methods assume that an initial shape is given in
some form, e.g., a mean shape [27, 28]. However, this assumption is too strict and
may lead to poor performance on faces with large pose variations.
2.2.2 Deep-Learning-Based Models
Despite their acknowledged successes, all the aforementioned conventional

approaches rely on complicated feature engineering and parameter tuning, which
consequently limits their performance in cluttered and diverse settings. Recently,
CNNs and other deep learning models have been successfully applied to various
visual computing tasks, including facial landmark estimation. Zhou et al. [33] pro-
posed a four-level cascaded regression model based on CNNs, which sequentially
predicted landmark coordinates. Zhang et al. [34] employed a deep architecture to
jointly optimize facial landmark positions with other related tasks, such as pose esti-
mation [35] and facial expression recognition [36]. Zhang et al. [37] proposed a
new coarse-to-fine DAE pipeline to progressively refine facial landmark locations.
In 2016, they further presented de-corrupt autoencoders to automatically recover the
genuine appearance of the occluded facial parts, followed by predicting the occlusive
facial landmarks [38]. Lai et al. [39] proposed an end-to-end CNN architecture to
learn highly discriminative shape-indexed features and then refined the shape using
the learned deep features via sequential regressions. Merget et al. [40] integrated
the global context in a fully convolutional network based on dilated convolutions
for generating robust features for landmark localization. Bulat et al. [41] utilized a
facial super-resolution technique to locate the facial landmarks from low-resolution
images. Tang et al. [42] proposed quantized densely connected U-Nets to largely
improve the information flow, which helps to enhance the accuracy of landmark
localization. RNN-based models [43–45] formulate facial landmark detection as a
sequential refinement process in an end-to-end manner. Recently, 3D face models
[46–50] have also been utilized to accurately locate the landmarks by modeling the
structure of facial landmarks. Moreover, many researchers have attempted to adapt
some unsupervised [51–53] or semisupervised [54] approaches to improve the pre-
cision of facial landmark detectors.
2.3 Pedestrian Detection
Pedestrian detection is a subtask of general object detection where pedestrians, rather

than all involved objects, are detected in a given image. Since this task is significant
for security monitoring, safe self-driving, and other application scenarios, it has been
extensively studied over the past years.
Due to the diversity of pedestrian gestures, the variety of backgrounds, and other
reasons, pedestrian detection could be very challenging. In the following, we list
several factors that could affect pedestrian detection.
Diversity of appearance. For instance, rather than standing as still figures, pedes-
trians could appear with different clothing, gestures, angle of view, and illumination.
Scale variation. Because of the distance to the camera, pedestrians would appear
at different scales in the image. Large-scale pedestrians are relatively easy to detect,
while pedestrians at small scales are challenging.
Occlusion. In practical scenarios, pedestrians could be occluded by each other or
by buildings, parked cars, trees, or other types of objects on the street.
Backgrounds. Algorithms are confronted with hard negative samples, which are
objects that appear like pedestrians and could easily be misclassified.
Time and space complexity. Due to the large amount of candidate bounding
boxes, the methods could be space consuming. Additionally, cascaded approaches
are used in some methods, which could be time consuming. However, practical usage
scenarios need real-time detection and memory saving.
2.3.1 Benchmarks for Pedestrian Detection
INRIA [55] was released in 2005, containing 1805 images of humans cropped from
a varied set of personal photos. ETH [56] was collected through strolls through busy
shopping streets. Daimler [57] contains pedestrians that are fully visible in an upright
position. TUD [58] was developed for many tasks, including pedestrian detection.
Positive samples of the training set were collected in a busy pedestrian zone with
a handheld camera, including not only upright standing pedestrians but also side
standing ones. Negative samples of the training set were collected in an inner city
district and also from vehicle driving videos. The test set is collected in the inner
city of Brussels from a driving car. All pedestrians are annotated. KITTI [59] was
collected by four high-resolution video cameras, and up to 15 cars and 30 pedestrians
are visible per image. Caltech [60] is the largest pedestrian dataset to date, collecting
10 h of vehicle driving video in an urban scenario. This dataset includes pedestrians
in different scales and positions, and various degrees of occlusions are also included.
2.3 Pedestrian Detection 19
2.3.2 Pedestrian Detection Methods
The existing methods can be divided into two categories: one is handcrafted features
followed by a classical classifier, and the other is deep learning methods.
2.3.2.1 Two-Stage Architectures of Pedestrian Detection
Early approaches typically consist of two separate stages: feature extraction and
binary classification. Candidate bounding boxes are generated by sliding-window
methods. Classic HOG [55] proposed using histogram of oriented gradients as fea-
tures and a linear support vector machine as the classifier. Following this framework,
various feature descriptors and classifiers were proposed. Typical classifiers include
nonlinear SVM and AdaBoost. HIKSVM [61] proposed using histogram intersec-
tion kernel SVM, which is a nonlinear SVM. RandForest [62] used a random forest
ensemble, rather than SVM, as the classifier. For various feature descriptors, ICF
[63] generalized several basic features to multiple channel features by computations
of linear filters, nonlinear transformations, pointwise transformations, integral his-
togram, and gradient histogram. Integral images are used to obtain the final features.
Features are learned by the boosting algorithm, while decision trees are employed as
the weak classifier. SCF [64] inherited the main idea of ICF, but it proposed a revision
with insights. Rather than using regular cells as the classic HOG method does, SCF
attempts to learn an irregular pattern of cells. The feature pool consists of squares
in detection windows. ACF [65] attempted to accelerate pyramid feature learning
though the aggregation of channel features. Additionally, it learns by AdaBoost [66],
whose base classifier is deep tree. LDCF [67] proposed a local decorrelation trans-
formation. SpatialPooling [68] was built based on ACF [65]. Spatial pooling is used
to compute the covariance descriptor and local binary pattern descriptor, enhancing
the robustness to noise and transformation. Features are learned by structural SVM.
[69] explored several types of filters, and a checkerboard filter achieved the best per-
formance. Deformable part models (DPMs) have been widely used for solving the
occlusion issue. [70] first proposed deformable parts filters, which are placed near
the bottom level of the HOG feature pyramid. A multiresolution model was proposed
by [71] as a DPM. [72] used DPM for multi-pedestrian detection and proved that
DPM can be flexibly incorporated with other descriptors such as HOG. [73] designed
a multitask form of DPM that captures the similarities and differences of samples.
DBN-Isol [74] proposed a discriminative deep model for learning the correlations of
deformable parts. In [75], a parts model was embedded into a designed deep model.
2.3.2.2 Deep Convolutional Architectures of Pedestrian Detection
Sermanet et al. [76] first used a deep convolutional architecture. Reference [76]
designed a multiscale convolutional network composed of two stages of convolu-
tional layers for feature extraction, which is followed by a classifier. The model is
first trained with unsupervised learning layer by layer and then using supervised
learning with a classifier for label prediction. Unlike previous approaches, this con-
volutional network performs end-to-end training, whose features are all learned from
the input data. Moreover, bootstrapping is used for relieving the imbalance between
positive and negative samples. JointDeep [77] designed a deep convolutional net-
work. Each of the convolutional layers in the proposed deep network is responsible
for a specific task, while the whole network is able to learn feature extraction, defor-
mation issues, occlusion issues, and classification jointly. MultiSDP [78] proposed
a multistage contextual deep model simulating the cascaded classifiers. Rather than
training sequentially, the cascaded classifiers in the deep model can be trained jointly
using backpropagation. SDN [79] proposed a switchable restricted Boltzmann for
better detection in cluttered background and variably presented pedestrians.
Driven by the success of (“slow”) R-CNN [80] for general object detection, a
recent series of methods have adopted a two-stage pipeline for pedestrian detection.
These methods first use proposal methods to predict candidate detection bounding
boxes, generally a large amount. These candidate boxes are then fed into a CNN for
feature learning and class prediction. In the task of pedestrian detection, the proposal
methods used are generally standalone pedestrian detectors consisting of handcrafted
features and boosted classifiers.
Reference [81] used SquaresChnFtrs [64] as proposal methods, which are fed
into a CNN for classification. In this paper, two CNNs with different scales were
tried, which are CifarNet [82] and AlexNet [83]. The methods were evaluated on the
Caltech [60] and KITTI [59] datasets. The performance was on par with the state of
the art at that time but is not yet able to surpass some of the handcrafted methods
due to the design of CNN and lack of parts or occlusion modeling.
TA-CNN [84] employed the ACF detector [65], incorporating with semantic infor-
mation, to generate proposals. The CNN used was revised from AlexNet [83]. This
method attempted to improve the model effects by relieving the confusion between
positive samples and hard negative ones. The method was evaluated on the Caltech
[60] and ETH [56] datasets, and it surpassed state-of-the-art methods.
DeepParts [85] applied the LDCF [67] detector to generate proposals and learned
a set of complementary parts by neural networks, improving occlusion detection.
They first constructed a part pool covering all positions and ratios of body parts,
and they automatically chose appropriate parts for part detection. Subsequently, the
model learned a part detector for each body part without using part annotations. These
part detectors are independent CNN classifiers, one for each body part. Furthermore,
proposal shifting problems were handled. Finally, full-body scores were inferred,
and pedestrian detection was fulfilled.
SAF R-CNN [86] implemented an intuitive revision of this R-CNN two-stage
approach. They used the ACF detector [65] for proposal generation. The proposals
were fed into a CNN, and they were soon separated into two branches of subnetwork,
driven by a scale-aware weighting layer. Each of the subnetworks is a popular Fast
R-CNN [15] framework. This approach improved small-size pedestrian detection.
2.3 Pedestrian Detection 21
Unlike the above R-CNN-based methods, the CompACT method [87] obtained
both handcrafted features and deep convolutional features, and on top of which it
learned boosted classifiers. A complexity-aware cascade boosting algorithm was
used such that features of various complexities are able to be integrated into one
single model.
CCF detector [88] is a boosted classifier on pyramids of deep convolutional fea-
tures, but it uses no region proposals. Rather than using deep convolutional network
as feature learner and predictor as mentioned methods do, this method utilized the
deep convolutional network as the first step image feature extractor.
2.4 Human Segmentation and Clothes Parsing
The goal of human parsing is to partition the human body into different semantic
parts, such as hair, head, torso, arms, legs, and so forth, which provides rich descrip-
tions for human-centric analysis and thus becomes increasingly important for many
computer vision applications, including content-based image/video retrieval, person
re-identification, video surveillance, action recognition and clothes fashion recogni-
tion. However, it is very challenging in real-life scenarios due to the variability in
human appearances and shapes caused by the large numbers of human poses, clothes
types, and occlusion/self-occlusion patterns.
Part segment proposal generation. Previous works generally adopt low-level
segment-based proposal. For example, some approaches take higher level cues. Bo
and Fowlkes exploited roughly learned part location priors and part mean shape infor-
mation, and they derived a number of part segments from the gPb-UCM method using
a constrained region merging method. Dong et al. employed the Parselets for pro-
posal to obtain mid-level part semantic information for the proposal. However, either
low-level, mid-level ,or rough location proposals may result in many false positives,
misleading the later process.
References
1. T. Sakai, M. Nagao, and T. Kanade, Computer Analysis and Classification of Photographs of

Human faces (Kyoto University, 1972)
2. K.-K. Sung, T. Poggio, Example-based learning for view-based human face detection. TPAMI
20(1), 39–51 (1998)
3. H. Rowley, S. Baluja, T. Kanade, Rotation invariant neural network-based face detection, in
CVPR. sn, p. 38 (1998)
4. P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in CVPR,
vol. 1. IEEE, pp. I–511 (2001)
5. P. Viola, M.J. Jones, Robust real-time face detection. IJCV 57(2), 137–154 (2004)
6. Q. Zhu, M.-C. Yeh, K.-T. Cheng, S. Avidan, Fast human detection using a cascade of histograms
of oriented gradients, in Computer Vision and Pattern Recognition, 2006 IEEE Computer
Society Conference on, vol. 2. IEEE, pp. 1491–1498 (2006)
7. P.C. Ng, S. Henikoff, Sift: Predicting amino acid changes that affect protein function. Nucleic
acids research 31(13), 3812–3814 (2003)
8. Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, J. R. Smith, Learning locally-adaptive decision
functions for person verification, in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 3610–3617 (2013)
9. S. Liao, A.K. Jain, S.Z. Li, A fast and accurate unconstrained face detector. IEEE transactions
on pattern analysis and machine intelligence 38(2), 211–223 (2016)
10. X. Zhu, D. Ramanan, Face detection, pose estimation, and landmark localization in the wild,
in CVPR. IEEE, pp. 2879–2886 (2012)
11. H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face
detection, in CVPR, pp. 5325–5334 (2015)
12. K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using multitask cascaded
convolutional networks. IEEE Signal Processing Letters 23(10), 1499–1503 (2016)
13. Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, X. Hu, Scale-aware face detection, in CVPR, vol. 3 (2017)
14. Y. Liu, H. Li, J. Yan, F. Wei, X. Wang, X. Tang, Recurrent scale approximation for object
detection in cnn, in ICCV, vol. 5 (2017)
15. R. Girshick, Fast r-cnn, in Proceedings of the IEEE International Conference on Computer
Vision, pp. 1440–1448 (2015)
16. S. Wan, Z. Chen, T. Zhang, B. Zhang, K.-k. Wong, Bootstrapping face detection with hard
negative examples, arXiv preprint arXiv:1608.02236 (2016)
17. V. Jain, E. Learned-Miller, Fddb: a benchmark for face detection in unconstrained settings,
Technical Report UM-CS-2010-009, University of Massachusetts, Amherst (Tech, Rep, 2010)
18. Y. Bai, Y. Zhang, M. Ding, B. Ghanem, Finding tiny faces in the wild with generative adversarial
network, inCVPR (2018)
19. T.F. Cootes, G.J. Edwards, C.J. Taylor, Active appearance models. PAMI 6, 681–685 (2001)
20. J.M. Saragih, S. Lucey, J.F. Cohn, Deformable model fitting by regularized landmark mean-
shift. IJCV 91(2), 200–215 (2011)
21. P.N. Belhumeur, D.W. Jacobs, D.J. Kriegman, N. Kumar, Localizing parts of faces using a
consensus of exemplars. PAMI 35(12), 2930–2940 (2013)
22. L. Liang, R. Xiao, F. Wen, J. Sun, Face alignment via component-based discriminative search,
in ECCV (Springer, 2008), pp. 72–85
23. M. Dantone, J. Gall, G. Fanelli, L. Van Gool, Real-time facial feature detection using conditional
regression forests, in CVPR (IEEE, 2012), pp. 2578–2585
24. M. Valstar, B. Martinez, X. Binefa, M. Pantic, Facial point detection using boosted regression
and graph models, in CVPR (IEEE, 2010), pp. 2729–2736
25. X. Cao, Y. Wei, F. Wen, J. Sun, Face alignment by explicit shape regression. IJCV 107(2),
177–190 (2014)
26. V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees,
in CVPR, pp. 1867–1874 (2014)
27. X. Xiong, F. Torre, Supervised descent method and its applications to face alignment, in CVPR,
pp. 532–539 (2013)
28. S. Ren, X. Cao, Y. Wei, J. Sun, Face alignment at 3000 fps via regressing local binary features,
in CVPR, pp. 1685–1692 (2014)
29. S. Zhu, C. Li, C.-C. Loy, X. Tang, Unconstrained face alignment via cascaded compositional
learning, in CVPR, pp. 3409–3417 (2016)
30. O. Tuzel, T. K. Marks, S. Tambe, Robust face alignment using a mixture of invariant experts,
31. X. Fan, R. Liu, Z. Luo, Y. Li, Y. Feng, Explicit shape regression with characteristic number for
facial landmark localization, TMM (2017)
32. X. Burgos-Artizzu, P. Perona, P. Dollár, Robust face landmark estimation under occlusion, in
ICCV, pp. 1513–1520 (2013)
33. E. Zhou, H. Fan, Z. Cao, Y. Jiang, Q. Yin, Extensive facial landmark localization with coarse-
to-fine convolutional network cascade, in ICCV Workshops, pp. 386–391 (2013)
References 23
34. Z. Zhang, P. Luo, C.C. Loy, X. Tang, Facial landmark detection by deep multi-task learning,
35. H. Liu, D. Kong, S. Wang, B. Yin, Sparse pose regression via componentwise clustering feature
point representation. TMM 18(7), 1233–1244 (2016)
36. T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, K. Yan, A deep neural network-driven feature
learning method for multi-view facial expression recognition. TMM 18(12), 2528–2536 (2016)
37. J. Zhang, S. Shan, M. Kan, X. Chen, Coarse-to-fine auto-encoder networks (cfan) for real-time
face alignment, in ECCV (Springer, 2014), pp. 1–16
38. J. Zhang, M. Kan, S. Shan, X. Chen, Occlusion-free face alignment: deep regression networks
coupled with de-corrupt autoencoders, in CVPR, pp. 3428–3437 (2016)
39. H. Lai, S. Xiao, Z. Cui, Y. Pan, C. Xu, S. Yan, Deep cascaded regression for face alignment,
arXiv preprint arXiv:1510.09083 (2015)
40. D. Merget, M. Rock, G. Rigoll, Robust facial landmark detection via a fully-convolutional
local-global context network, in CVPR, pp. 781–790 (2018)
41. A. Bulat and G. Tzimiropoulos, Super-fan: Integrated facial landmark localization and super-
resolution of real-world low resolution faces in arbitrary poses with gans, in CVPR (2018)
42. Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, D. Metaxas, Quantized densely connected u-nets
for efficient landmark localization, in ECCV (2018)
43. X. Peng, R.S. Feris, X. Wang, D.N. Metaxas, A recurrent encoder-decoder network for sequen-
tial face alignment, in ECCV (Springer, 2016), pp. 38–56
44. S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, A. Kassim, Robust facial landmark detection via
recurrent attentive-refinement networks, in ECCV (Springer, 2016), pp. 57–72
45. G. Trigeorgis, P. Snape, M.A. Nicolaou, E. Antonakos, S. Zafeiriou, Mnemonic descent method:
a recurrent process applied for end-to-end face alignment, in CVPR, pp. 4177–4187 (2016)
46. X. Zhu, Z. Lei, X. Liu, H. Shi, S. Z. Li, Face alignment across large poses: a 3d solution, in
CVPR, pp. 146–155 (2016)
47. A. Jourabloo, X. Liu, Large-pose face alignment via cnn-based dense 3d model fitting, in
CVPR, pp. 4188–4196 (2016)
48. F. Liu, D. Zeng, Q. Zhao, X. Liu, Joint face alignment and 3d face reconstruction, in ECCV
(Springer, 2016), pp. 545–560
49. A. Bulat, G. Tzimiropoulos, How far are we from solving the 2d & 3d face alignment problem?
(and a dataset of 230,000 3d facial landmarks, in CVPR, vol. 1, no. 2, p. 4 (2017)
50. Y. Feng, F. Wu, X. Shao, Y. Wang, X. Zhou, Joint 3d face reconstruction and dense alignment
with position map regression network, in ECCV (2018)
51. X. Dong, S.-I. Yu, X. Weng, S.-E. Wei, Y. Yang, Y. Sheikh, Supervision-by-registration: an
unsupervised approach to improve the precision of facial landmark detectors, in CVPR, pp.
360–368 (2018)
52. Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, H. Lee, Unsupervised discovery of object landmarks
as structural representations, in CVPR (2018)
53. X. Dong, Y. Yan, W. Ouyang, Y. Yang, Style aggregated network for facial landmark detection,
in CVPR, vol. 2, p. 6 (2018)
54. S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, J. Kautz, Improving landmark localization
with semi-supervised learning, in CVPR (2018)
55. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (2005)
56. B.L. Andreas Ess, L. Van Gool, Depth and appearance for mobile scene analysis, in IEEE
International Conference on Computer Vision (ICCV) (2007)
57. M. Enzweiler, D.M. Gavrila, Monocular pedestrian detection: Survey and experiments. IEEE
Trans. Pattern Anal. Mach. Intell. 12, 2179–2195 (2008)
58. C. Wojek, S. Walk, B. Schiele, Multi-cue onboard pedestrian detection (2009)
59. A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark
suite, in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (IEEE,
2012), pp. 3354–3361
60. B. Schiele Piotr Dollár, C. Wojek, P. Perona, Pedestrian detection: an evaluation of the state
of the art (2012)
61. S. Maji, A.C. Berg, J. Malik, Classification using intersection kernel support vector machines
is efficient, in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference
on, pp. 1–8. IEEE (2008)
62. J. Marin, D. Vázquez, A.M. López, J. Amores, B. Leibe, Random forests of local experts for
pedestrian detection, in Proceedings of the IEEE International Conference on Computer Vision,
pp. 2592–2599 (2013)
63. P.P. Piotr Dollár, Z. Tu, S. Belongie, Integral channel features, in British Machine Vision
Conference (BMVC) (2009)
64. R. Benenson, M. Mathias, T. Tuytelaars, L. Van Gool, Seeking the strongest rigid detector,
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.
3666–3673 (2013)
65. S.B. Piotr Dollár, R. Appel, P. Perona, Fast feature pyramids for object detection (2014)
66. R. Tibshirani-et al. J. Friedman, T. Hastie, Additive logistic regression: a statistical view of
boosting, in The Annals of Statistics (2000)
67. W. Nam, P. Dollár, J.H. Han, Local decorrelation for improved pedestrian detection, in Advances
in Neural Information Processing Systems, pp. 424–432 (2014)
68. S. Paisitkriangkrai, C. Shen, A. Van Den Hengel, Strengthening the effectiveness of pedestrian
detection with spatially pooled features, in European Conference on Computer Vision (Springer,
2014), pp. 546–561
69. S. Zhang, R. Benenson, B. Schiele, et al., Filtered channel features for pedestrian detection, in
CVPR, volume 1, p. 4 (2015)
70. P. Felzenszwalb, D. McAllester, D. Ramanan. A discriminatively trained, multiscale,
deformable part model, in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE
Conference on (IEEE, 2008), pp. 1–8
71. D. Park, D. Ramanan, C. Fowlkes, Multiresolution models for object detection, in European
Conference on Computer Vision (Springer, 2010), pp. 241–254
72. W. Ouyang, X. Wang, Single-pedestrian detection aided by multi-pedestrian detection, in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3198–3205
(2013)
73. J. Yan, X. Zhang, Z. Lei, S. Liao, S.Z. Li, Robust multi-resolution pedestrian detection in traffic
scenes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 3033–3040 (2013)
74. X. Wang, W. Ouyang, A discriminative deep model for pedestrian detection with occlusion
handling, in 2012 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2012),
pp. 3258–3265
75. W. Ouyang, X. Zeng, X. Wang, Modeling mutual visibility relationship in pedestrian detection,
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3222–
3229 (2013)
76. P. Sermanet, K. Kavukcuoglu, S. Chintala, Y. LeCun, Pedestrian detection with unsupervised
multi-stage feature learning, in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 3626–3633 (2013)
77. W. Ouyang, X. Wang, Joint deep learning for pedestrian detection, in Proceedings of the IEEE
International Conference on Computer Vision, pp. 2056–2063 (2013)
78. X. Zeng, W. Ouyang, X. Wang, Multi-stage contextual deep learning for pedestrian detection,
in Proceedings of the IEEE International Conference on Computer Vision, pp. 121–128 (2013)
79. P. Luo, Y. Tian, X. Wang, X. Tang, Switchable deep network for pedestrian detection, in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 899–
906 (2014)
80. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object
detection and semantic segmentation, in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 580–587 (2014)
References 25
81. J. Hosang, M. Omran, R. Benenson, B. Schiele, Taking a deeper look at pedestrians, in Pro-
(2015)
82. A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images (Technical
report, Citeseer, 2009)
83. A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional
neural networks, in Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
84. X. Wang, Y. Tian, P. Luo, X. Tang, Pedestrian detection aided by deep learning semantic tasks,
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
85. X. Wang, Y. Tian, P. Luo, X. Tang, Deep learning strong parts for pedestrian detection, in
IEEE International Conference on Computer Vision (ICCV) (2015)
86. Jianan Li, Xiaodan Liang, ShengMei Shen, Xu Tingfa, Jiashi Feng, Shuicheng Yan, Scale-
aware fast r-cnn for pedestrian detection. IEEE Transactions on Multimedia 20(4), 985–996
(2018)
87. M. Saberian, Z. Cai, N. Vasconcelos, Learning complexity-aware cascades for deep pedestrian
detection, in IEEE International Conference on Computer Vision (ICCV) (2015)
88. B. Yang, J. Yan, Z. Lei, S.Z. Li, Convolutional channel features, in ICCV, pp. 82–90 (2015)
Part II
Localizing Persons in Images
Finding people in images/videos is one of the fundamental problems of computer

vision and has been widely studied over recent decades. It is an important step toward
many subsequent applications such as face recognition, human pose estimation, and
smart surveillance. In this part, we introduce two specific studies of finding people
in images, i.e., facial landmark localization and pedestrian detection.
With recent advances in deep learning techniques and large-scale annotated
image datasets, deep convolutional neural network models have achieved significant
progress in salient object detection [1], crowd analysis [2, 3], and facial landmark
localization [4]. Facial landmark localization is typically formulated as a regression
problem. Among the existing methods that follow this approach, cascaded deep con-
volutional neural networks [5, 6] have emerged as one of the leading methods because
of their superior accuracy. Nevertheless, the three-level cascaded CNN framework
is complicated and unwieldy. It is arduous to jointly handle the classification (i.e.,
whether a landmark exists) and localization problems for unconstrained settings.
Long et al. [7] recently proposed an FCN for pixel labeling, which takes an input
image with an arbitrary size and produces a dense label map with the same res-
olution. This approach shows convincing results for semantic image segmentation
and is also very efficient because convolutions are shared among overlapping image
patches. Notably, classification and localization can be simultaneously achieved with
a dense label map. The success of this work inspires us to adopt an FCN in our task,
i.e., pixelwise facial landmark prediction. Nevertheless, a specialized architecture
is required because our task requires more accurate prediction than generic image
labeling.
Pedestrian detection is an essential task for an intelligent video surveillance sys-
tem. It has also been an active research area in computer vision in recent years.
Many pedestrian detectors, such as [8, 9], have been proposed based on handcrafted
features. With the great success achieved by deep models in many tasks of com-
puter vision, hybrid methods that combine traditional, handcrafted features [8, 9]
and deep convolutional features [10, 11] have become popular. For example, in [12],
a stand-alone pedestrian detector (which uses squares channel features) is adopted as
a highly selective proposer (<3 regions per image), followed by R-CNN [13] for clas-
sification. Thus, in this part, we will also discuss these types of pedestrian detection
methods.
28 Part II: Localizing Persons in Images
References
1. T. Chen, L. Lin, L. Liu, X. Luo, X. Li, Disc: deep image saliency computing via
progressive representation learning. TNNLS 27(6), 1135–1149 (2016)
2. L. Liu, H. Wang, G. Li, W. Ouyang, L. Lin, Crowd counting using deep recurrent
spatial-aware network, in IJCAI (2018)
3. L. Liu, R. Zhang, J. Peng, G. Li, B. Du, L. Lin, Attentive crowd flow machines,
in ACM MM (ACM, 2018), pp. 1553–1561
4. Z. Zhang, P. Luo, C.C. Loy, X. Tang, Facial landmark detection by deep multi-task
learning, in ECCV (Springer, 2014), pp. 94–108
5. Y. Sun, X. Wang, X. Tang, Deep convolutional network cascade for facial point
detection, in CVPR, pp. 3476–3483 (2013)
6. R. Weng, J. Lu, Y.-P. Tan, J. Zhou, Learning cascaded deep auto-encoder networks
for face alignment, TMM, vol. 18, no. 10, pp. 2066–2078 (2016)
7. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic
segmentation, in CVPR, pp. 3431–3440 (2015)
8. P. Perona, P. Dollár, Z. Tu, S. Belongie, Integral channel features, in British
Machine Vision Conference (BMVC) (2009)
9. S. Belongie, P. Dollár, R. Appel, P. Perona, Fast feature pyramids for object
detection (2014)
10. A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep
convolutional neural networks, in Advances in Neural Information Processing
Systems, pp. 1097–1105 (2012)
11. KarVery deep convolutional networks for large-scale image recognitionen
Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scale image recognition. In arXiv:1409.1556 (2014)
12. J. Hosang, M. Omran, R. Benenson, B. Schiele, Taking a deeper look at pedes-
trians, in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 4073–4082 (2015)
13. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accu-
rate object detection and semantic segmentation, in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 580–587 (2014)
Chapter 3
Face Localization and Enhancement
Abstract Facial landmark localization plays a critical role in facial recognition and
analysis. In this chapter, we first propose a novel cascaded backbone-branches fully
convolutional neural network (BB-FCN) for rapidly and accurately localizing facial
landmarks in unconstrained and cluttered settings. The proposed BB-FCN generates
facial landmark response maps directly from raw images without any preprocessing.
It follows a coarse-to-fine cascaded pipeline, which consists of a backbone network
for roughly detecting the locations of all facial landmarks and one branch network
for each type of detected landmark to further refine their locations (c
[2019] IEEE.
Reprinted, with permission, from [1].). At the end of this chapter, we also introduce
the progress in face hallucination, a fundamental problem in the face analysis field
that refers to generating a high-resolution facial image from a low-resolution input
image ( c
[2019] IEEE. Reprinted, with permission, from [2].).
3.1 Facial Landmark Machines
Facial landmark localization aims to automatically predict the key point positions
in facial image regions. This task is an essential component in many face-related
applications, such as facial attribute analysis [3], facial verification [4, 5], and facial
recognition [6–8]. Although tremendous effort has been devoted to this topic, the
performance of facial landmark localization is still far from perfect, particularly in
facial regions with severe occlusions or extreme head poses.
Most of the existing approaches to facial landmark localization have been devel-
oped for a controlled setting, e.g., the facial regions are detected in a preprocessing
step. This setting has drawbacks when working with images taken in the wild (e.g.,
cluttered surveillance scenes), where automated face detection is not always reliable.
The objective of this work is to propose an effective and efficient facial landmark
localization method that is capable of handling images taken in unconstrained set-
tings that contain multiple faces, extreme head poses, and occlusions (see Fig. 3.1).
Specifically, we focus on the following issues when developing our algorithm.
• Faces may have great variations in appearance and structure in unconstrained set-
tings due to diverse viewing conditions, rich facial expressions, large pose changes,

https://doi.org/10.1007/978-981-13-2387-4_3
30 3 Face Localization and Enhancement
Fig. 3.1 Facial landmark localization in unconstrained settings. First row: Two cluttered images
with an unknown number of faces; second row: Dense response maps generated by our method
facial accessories (e.g., glasses and hats), and aging. Therefore, traditional global
models may not work well because the usual assumptions (e.g., certain spatial
layouts) may not hold in such environments.
• Boosted-cascade-based fast face detectors, which evolved from the seminal work
of Viola and Jones [9], can work well only for near-frontal faces under normal
conditions. Although accurate deformable part-based models [10] can perform
much better on challenging datasets, these models are slow due to their high
complexity. Detection in an image takes a few seconds, which makes such detectors
impractical for our task.
In this section, we formulate facial landmark localization as a pixel-labeling prob-
lem and devise a fully convolutional neural network (FCN) to overcome the afore-
mentioned issues. The proposed approach produces facial landmark response maps
directly from raw images without relying on any preprocessing or feature engineer-
ing. Two typical landmark response maps generated with our method are shown in
Fig. 3.1.
Considering both computational efficiency and localization accuracy, we pose
facial landmark localization as a cascaded filtering process. In particular, the locations
of facial landmarks are first roughly detected in a global context and then refined
by observing local regions. To this end, we introduce a novel FCN architecture that
naturally follows this coarse-to-fine pipeline. Specifically, our architecture contains
one backbone network and several branches, with each branch corresponding to
3.1 Facial Landmark Machines 31
one landmark type. For computational efficiency, the backbone network is designed
to be an FCN with lightweight filters, which takes a low-resolution image as its
input and rapidly generates an initial multichannel heat map, with each channel
predicting the location of a specific landmark. We can obtain landmark proposals
from each channel of the initial heat map. We can then crop a region centered at
every landmark proposal from both the original input image and the corresponding
channel of the response map. These cropped regions are stacked and fed to a branch
network for fine and accurate localization. Because fully connected layers are not
used in either network, we call our architecture a cascaded backbone-branches fully
convolutional network (BB-FCN). Due to the tailored architecture of the backbone
network, which can reject most background regions and retain high-quality landmark
proposals, the BB-FCN is also capable of accurately localizing the landmarks of faces
on various scales by rapidly scanning every level of the constructed image pyramid.
Furthermore, we have discovered that our landmark localization results can help
generate fewer and higher quality face proposals, thus enhancing the accuracy and
efficiency of face detection.
3.2 The Cascaded BB-FCN Architecture
Given an unconstrained image I with an unknown number of faces, our facial land-
mark localization method aims to locate all facial landmarks in the image. We use
L ik = (xik , yik ) to denote the location of the ith landmark of type k in image I , where
xik and yik represent the coordinates of this landmark. Then, our task is to obtain the
complete set of landmarks in I ,
Det (I ) = {(xik , yik )}i,k , (3.1)
where k = 1, 2, ..., K . When describing our method and analyzing the proposed
network, we set K = 5 as an example, but our method is also applicable to any other
values of K . Here, the five landmark types are the left eye (LE), right eye (RE), nose
(N), left mouth corner (LM), and right mouth corner (RM).
In contrast to existing approaches that predict landmark locations by coordinate
regression, we exploit fully convolutional neural networks (FCNs) to directly produce
response maps that indicate the probability of landmark existence at every image
location. In our method, the predicted value at each location of the response map
can be viewed as a series of filtering operations applied to a specific region of the
input image. The specific region is called the receptive field. An ideal series of filters
should have the following property: a receptive field with a landmark of a specific
type located at its center should return a strong response value, while receptive fields
without that type of landmark in the center should yield weak responses. Let FWk (P)
denote the result of applying a series of filtering functions with parameter setting Wk
for type-k landmarks to receptive field P, and it is defined as follows:

1 if P has a type-k landmark in the center;
FWk (P) = (3.2)
0 otherwise.
Applying this function in a sliding window manner to w × h overlapping receptive

fields in an input image I generates a response map FWk ∗ I of size w × h, whose
value at location (x, y) can thus be defined as
(FWk ∗ I )(x, y) = FWk (I (P(x, y))), (3.3)
where I (P(x, y)) denotes the image patch corresponding to the receptive field of
location (x,y) in the output response map. If the response value is larger than a
threshold θ, a landmark of type k is detected at the center of the patch in image I .
According to Eq. (3.3), there is a trade-off between localization accuracy and
computational cost. To achieve high accuracy, we need to compute response values
for significantly overlapping receptive fields. However, to accelerate the detection
process, we should generate a coarser response map on receptive fields with less
overlap or from a lower resolution image. This motivates us to develop a cascaded
coarse-to-fine process to localize landmarks progressively, in a spirit similar to that
of the hierarchical deep networks in [11], for image classification. More specifically,
our network consists of two components. The first component generates a coarse
response map from a relatively low-resolution input, identifying rough landmark
locations. Then, the second component takes local patches centered at every estimated
landmark location and applies another filtering process to the local patches to obtain
a fine response map for accurate landmark localization.
In this section, this two-component architecture is implemented as a backbone-
branches fully convolutional neural network in which the backbone network gen-
erates coarse response maps for rough location inference, and the branch networks
produce fine response maps for accurate location refinement. Figure 3.2 shows the
architecture of our network.
Let a convolutional layer be denoted as C(n, h × w × ch) and a deconvolutional
layer be denoted as D(n, h × w × ch), where n represents the number of kernels and
h,w,ch represent the height, width, and number of channels of a kernel, respectively.
We also use M P to denote a max-pooling layer. In our network, the stride of all
convolutional layers is 1, and the stride of all deconvolutional layers is 2. The size
of the max-pooling operator is set to 2 × 2, and the stride is 2.
3.2.1 Backbone Network
The backbone network is a fully convolutional network. It efficiently generates an ini-

tial low-resolution response map for input image I . When localizing facial landmarks
in an image taken in an unconstrained setting, it can effectively reject a majority of
background regions with a threshold. Let Wc denote the parameters and H k (I ; Wc )
3.2 The Cascaded BB-FCN Architecture 33
Backbone Network
20*32*32 15*32*32 5*32*32
3*32*32 30*16*16 30*16*16
40*8*8
Resize 4*24*24 5*24*24 5*24*24 5*24*24 5*24*24 1*24*24
LE
…
Guided
Crop
4*24*24 5*24*24 5*24*24 5*24*24 5*24*24 1*24*24
RM
Branch Network
Fig. 3.2 The main architecture of the proposed backbone-branches fully convolutional neural
network. This approach is capable of producing pixelwise facial landmark response maps in a
progressive manner. The backbone network first generates low-resolution response maps that iden-
tify approximate landmark locations via a fully convolutional network. The branch networks then
produce fine response maps over local regions for more accurate landmark localization. There are
K (e.g., K = 5) branches, each of which corresponds to one type of facial landmark and refines
the related response map. Only downsampling, upsampling, and prediction layers are shown, and
intermediate convolutional layers are omitted in the network branches
denote the predicted heat map of image I for the kth type of landmarks. The value of
H k (I ; Wc ) at position (x, y) can be computed with Eq. (3.3). We train the backbone
FCN using the following loss function:

K
L1 (I ; Wc ) = ||H k (I ; Wc ) − Hck (I )||2 , (3.4)
k=1
where Hck (I ) is the ground truth map for type-k landmarks.
3.2.2 Branch Network
The branch network is composed of K branches, each responsible for detecting

one type of landmark. All the K branches are designed to share the same network
structure. Take one branch as an example. Cropped patches of the original input image
and regions from the backbone output heat map are stacked as its input. The input
data, therefore, consist of four channels, including 3 channels from the original RG B
image and 1 channel from the corresponding channel of the backbone output heat
map. To make the branch network better suited for landmark position refinement, we
resize the original input image to 64 × 64, four times the size of the backbone input,
and at the same time zoom the heat map from the backbone network to 64 × 64.
The resolution of all the cropped patches is 24 × 24, and they are all centered at the
landmark position predicted by the backbone network. As shown in Fig. 3.2, each
branch is trained in the same way as the backbone network. We denote the parameters
of the branch component for type-k landmarks as Wkf and use H (P; Wkf ), H0k (P)
to denote the heat map that it generates and the corresponding ground truth heat map
of patch P, respectively. The loss function of this branch component is again defined
as follows:
L2 (P; Wkf ) = ||H (P; Wkf ) − H0k (P)||2 . (3.5)
Each branch component is composed of 5 convolutional layers with no pooling

operations. The dimensions of the input data of each branch are 24 × 24 × 4. The
first 4 convolutional layers consist of 5 channels with a kernel size of 5 and a stride
of 1, while the last convolutional layer consists of 5 channels with kernel size 1 and
stride 1. As shown in Fig. 3.2, each branch FCN component is detailed as follows:
C(5, 5 × 5 × 5) - C(5, 5 × 5 × 5) - C(5, 5 × 5 × 5) - C(5, 5 × 5 × 5) - C(1, 1 ×
1 × 5).
3.2.3 Ground Truth Heat Map Generation
To our knowledge, the ground truth of a facial landmark is traditionally given as a

single pixel location (x, y). To adapt such landmark specifications for the training
stage of our proposed BB-FCN network, we generate the ground truth heat map
(a) (b)
Fig. 3.3 a An isolated point cannot accurately reflect discrepancies among multiple annotations.
The three points near the right mouth corner were annotated by three different workers. b We label
a landmark as a small circular region rather than an isolated point in the ground truth heat map
3.2 The Cascaded BB-FCN Architecture 35
of an input image according to the annotated facial landmark locations. The most
straightforward method assigns “1” to a single pixel corresponding to each landmark
location and “0” to the remaining pixels. However, we argue that this method is
suboptimal because an isolated point cannot reflect discrepancies among multiple
annotations. As shown in Fig. 3.3a, the right mouth corner has three slightly different
locations marked by three annotators. To account for such discrepancies, we label
each landmark as a small region rather than an isolated point. We initialize the heat
map using zero everywhere, and then for each landmark p, we mark a circular region
with center p and radius R in the ground truth heat map with 1. Different radii are
adopted for the backbone network and branch networks, denoted as Rc and R f ,
respectively. R f is set to be smaller than Rc because the backbone network estimates
coarse landmark positions, while the branch networks predict accurate landmark
locations.
3.3 Experimental Results
3.3.1 Datasets
To train our proposed BB-FCN, we collect 7317 facial images (6317 for training, 1000
for validation) from the Internet and collect 7542 natural images (6542 for training,
1000 for validation) with no faces from Pascal-VOC2012 as negative samples. Each
face is annotated with 72 landmarks. We use two challenging public datasets for
evaluation: AFW [10] and AFLW [12]. There is no overlap among the training,
validation, and evaluation datasets.
AFW: This dataset contains 205 images (468 faces) collected in the wild. Invisible
landmarks are not annotated, and each face is annotated with at most 6 landmarks.
This dataset is intended for use in testing facial keypoint detection in unconstrained
settings, meaning faces may exhibit large variations in pose, expression, and illumi-
nation and may have severe occlusions.
AFLW: This dataset contains 21,080 faces with large pose variations. It is highly
suitable for evaluating the performance of face alignment across a large range of
poses. The selection of testing images from AFLW follows [13], which randomly
chooses 3000 faces, 39% of which are nonfrontal.
3.3.2 Evaluation Metric
To evaluate the accuracy of facial landmark localization, we adopt the mean (position)
error as the metric. For a specific type of landmark, the mean error is calculated as
the mean distance between the detected landmarks of the given type in all testing
images and their corresponding ground truth positions, normalized with respect to the
interocular distance. The (position) error of a single landmark is defined as follows:

(x − x )2 + (y − y )2
err = × 100%, (3.6)
l
where (x, y) and (x , y ) are the ground truth and detected landmark locations, respec-
tively, and the interocular distance l is the Euclidean distance between the center
points of the two eyes. In our experiments, we evaluate the mean error of every type
of facial landmark as well as the average mean error over all landmark types, i.e.,
LE (left eye), RE (right eye), N (nose), LM (left mouth corner), RM (right mouth
corner), and A (average mean error of the five facial landmarks).
3.3.3 Performance Evaluation for Unconstrained Settings
The BB-FCN is capable of dealing with facial images taken in unconstrained settings;
e.g., the location of facial regions and the number of faces are unknown. We eval-
uate the performance of the BB-FCN using recall–error curves. A predictive facial
landmark is considered correct if there exists a ground truth landmark of the same
type within the given position error. For a fixed number of predictive landmarks, the
recall rate (the fraction of ground truth annotations covered by predictive landmarks)
varies as the acceptable position error increases; thus, a recall–error curve can be
obtained.
We evaluate the performance of the BB-FCN and the regression-based deep model
on the AFW dataset using an unconstrained setting. For faces with one or both eyes
invisible, the interocular distances are set at 41.9% of the length of their annotated
bounding boxes. The BB-FCN significantly outperforms the regression network, and
the complete BB-FCN model performs much better than the backbone network alone.
With a prediction of 15 landmarks for each landmark type, the complete model recalls
45% more landmarks than the regression network when the acceptable position error
is set within 8% of the interocular distance. As the number of landmark predictions of
each type increases to 30, the recall of five landmarks within a position error of 25%
of the interocular distance is 94.1, 95.7, 91.5, 95.8, and 95.2%. Given more predicted
landmarks, we can achieve higher landmark recollections. Figure 3.4 demonstrates
some landmark detection results on the AFW dataset in unconstrained settings.
3.3.4 Comparison with the State of the Art
We compare our method with other state-of-the-art methods, i.e., (1) robust cas-
caded pose regression (RCPR) [14]; (2) a tree structured part model (TSPM) [10];
(3) Luxand face SDK; (4) explicit shape regression (ESR) [15]; (5) a cascaded
3.3 Experimental Results 37
Fig. 3.4 Qualitative facial landmark detection results in unconstrained settings. The BB-FCN is
capable of dealing with unconstrained facial images, even though the location of facial regions and
the number of faces in the image are unknown. Best viewed in color
Fig. 3.5 Qualitative facial landmark localization results by our method. The first row shows the
results on the AFW dataset, and the second row shows the results on the AFLW dataset. Our method
is robust in conditions of occlusion, exaggerated expressions, and extreme illumination
deformable shape model (CDM) [16]; (6) the supervised descent method (SDM)
[17]; (7) a task-constrained deep convolutional network (TCDCN) [13]; (8) multi-
task cascaded convolutional networks (MTCNN) [18]; and (9) recurrent attentive-
refinement networks (RAR) [19]. The results of some competing methods are quoted
from [13].
On the AFW dataset, our average mean error over five landmark types is 6.18%,
which improves over the performance of the state-of-the-art TDCN by 24.6%. On
the AFLW dataset, the BB-FCN model achieves 6.28% average mean error, 21.5%
improvement over TDCN. The qualitative results in Fig. 3.5 show that our method is
robust in conditions of occlusion, exaggerated expressions, and extreme illumination.
3.4 Attention-Aware Face Hallucination
Face hallucination refers to generating a high-resolution facial image from a low-

resolution input image, which is a fundamental problem in the face analysis field; face
Fig. 3.6 Sequentially discovering and enhancing facial parts in our attention-FH framework. At
each time step, our framework specifies an attended region based on past hallucination results and
enhances it by considering the global perspective of the whole face. The red solid bounding boxes
indicate the latest perceived patch in each step, and the blue dashed bounding boxes indicate all
the previously enhanced regions. We adopt a global reward at the end of the sequence to drive the
framework learning under the reinforcement learning paradigm
hallucination can facilitate several face-related tasks, such as face attribute recogni-
tion [20], face alignment [21], and facial recognition [22], in the complex real-world
scenarios in which facial images are often very low quality.
The existing face hallucination methods usually focus on how to learn a dis-
criminative patch-to-patch mapping from LR images to HR images. Particularly,
substantial recent progress has been made by employing advanced convolutional
neural networks (CNNs) [23] and multiple cascaded CNNs [24]. The face structure
priors and spatial configurations [25, 26] are often treated as external information
for enhancing faces and facial parts. However, the contextual dependencies among
the facial parts are usually ignored during hallucination processing. According to
studies of the human perception process [27], humans start by perceiving whole
images and successively explore a sequence of regions with the attention shifting
mechanism rather than separately processing the local regions. This finding inspires
us to explore a new pipeline of face hallucination by sequentially searching for the
attentional local regions and considering their contextual dependency from a global
perspective.
Inspired by the recent successes of attention and recurrent models in a variety
of computer vision tasks [28–30], we propose an attention-aware face hallucination
(attention-FH) framework that recurrently discovers facial parts and enhances them
by fully exploiting the global interdependency of the image, as shown in Fig. 3.6.
In particular, accounting for the diverse characteristics of facial images in terms
of blurriness, pose, illumination, and facial appearance, we search for an optimal
accommodated enhancement route for each face hallucination. We resort to the deep
reinforcement learning (RL) method [31] to harness the model learning because this
technique has been demonstrated to be effective in globally optimizing sequential
models without supervision for every step.
3.4 Attention-Aware Face Hallucination 39
Specifically, our attention-FH framework jointly optimizes a recurrent policy net-

work that learns the policies for selecting the preferable facial part in each step
and a local enhancement network for facial parts hallucination that considers the
previous enhancement results for the whole face. In this way, rich correlation cues
among different facial parts can be explicitly incorporated into each step of the local
enhancement process. For example, the agent can improve the enhancement of the
mouth region by considering a clearer version of the eye region, as Fig. 3.6 illustrates.
We define the global reward for reinforcement learning by the overall performance
of the super-resolved face, which drives the recurrent policy network optimization.
The recurrent policy network is optimized by following the reinforcement learning
(RL) procedure, which can be treated as a Markov decision process (MDP) maxi-
mized with a long-term global reward. In each time step, we learn the policies to
determine the optimal location of the next attended region by conditioning on the
current enhanced whole face and the history actions. One long short-term memory
(LSTM) layer is utilized to capture the past information of the attended facial parts.
The history actions are also memorized to avoid trapping the inference in a repetitive
action cycle.
3.4.1 The Framework of Attention-Aware Face Hallucination
Given a facial image Ilr with low resolution, our attention-FH framework targets the
corresponding high-resolution facial image Ihr by learning a projection function F:
Ihr = F(Ilr |θ), (3.7)
where θ denotes the function parameters. Our attention-FH framework is intended

to sequentially locate and enhance the attended facial parts in each step, which can
be formulated as a deep reinforcement learning procedure. Our framework consists
of two networks: the recurrent policy network, which dynamically determines the
specific facial part to be enhanced in the current step, and the local enhancement
network, which is employed to further enhance the selected facial part.
Specifically, the whole hallucination procedure of our attention-FH framework
can be formulated as follows. Given the input image It−1 at the t-th step, the agent
of the recurrent policy network selects one local facial part Iˆt−1
lt
with the location lt :
lt = f π (st−1 ; θπ ),
(3.8)
Iˆt−1
lt
= g(lt , It−1 ),
where f π represents the recurrent policy network and θπ is the network parameters.
st−1 is the encoded input state of the recurrent policy network, which is constructed
by the input image It−1 and the encoded history action h t−1 . g denotes a cropping
operation that crops a fixed-size patch from It−1 at location lt as the selected facial
part. The patch size is set as 60 × 45 for all facial images.
We then enhance each local facial part Iˆt−1
lt
using our local enhancement network
ˆ l t
f e . The resulting enhanced local patch It is computed as
Iˆtlt = f e ( Iˆt−1
lt
, It−1 ; θe ), (3.9)
where θe is the local enhancement network parameters. The output image It at each
t-th step is therefore obtained by replacing the local patch of the input image It−1 at
location lt with the enhanced patch Iˆtlt . Our whole sequential attention-FH procedure
can be written as ⎧
⎪
⎨ I0 = Ilr
It = f (It−1 ; θ) 1 ≤ t ≤ T, (3.10)
⎪
⎩
Ihr = IT
where T is the maximal number of local patch mining steps, θ = [θπ ; θe ] and f =
[ f π ; f e ]. We set T = 25 empirically throughout this paper.
3.4.2 Recurrent Policy Network
The recurrent policy network performs sequential local patch mining, which can be
treated as a decision-making process at discrete time intervals. At each time step,
the agent acts to determine an optimal image patch to be enhanced by conditioning
on the current state that it has reached. Given the selected location, the extracted
local patch is enhanced through the proposed local enhancement network. During
each time step, the state is updated by rendering the hallucinated facial image with
the enhanced facial part. The policy network recurrently selects and enhances local
patches until the maximum time step is achieved. At the end of this sequence, a
delayed global reward, which is measured by the mean squared error between the
final face hallucination result and the ground truth high-resolution image, is employed
to guide the policy learning of the agent. The agent can thus iterate to explore an
optimal search route for each individual facial image to maximize the global holistic
reward.
State: The state st at the tth step should be able to provide enough information for
the agent to make a decision without looking back more than one step. It is, therefore,
composed of two parts: (1) the enhanced hallucinated facial image It from previous
steps, which enables the agent to sense rich contextual information for processing a
new patch, e.g., the part that is still blurred and requires enhancement, and (2) the
latent variable h t , which is obtained by forwarding the encoded history action vector
h t−1 to the LSTM layer and is used to incorporate all previous actions. Therefore,
the goal of the agent is to determine the location of the next attended local patch by
sequentially observing state st = {It , h t } to generate a high-resolution image Ihr .
Fig. 3.7 Network architecture of our recurrent policy network and local enhancement network.
At each time step, the recurrent policy network takes a current hallucination result It−1 and action
history vector encoded by LSTM (512 hidden states) as the input and then outputs the action prob-
abilities for all W × H locations, where W and H are the width and height of the input image,
respectively. The policy network first encodes the It−1 with one fully connected layer (256 neu-
rons) and then fuses the encoded image and the action vector with an LSTM layer. Finally, a fully
connected linear layer is appended to generate the W × H -way probabilities. Based on the proba-
bility map, we extract the local patch and then pass the patch and It−1 into the local enhancement
network to generate the enhanced patch. The local enhancement network is constructed by two fully
connected layers (each with 256 neurons) encoding It−1 and 8 cascaded convolutional layers for
image patch enhancement. Thus, a new face hallucination result can be generated by replacing the
local patch with an enhanced patch
Action: Given a facial image I with size W × H , the agent selects one action from
all possible locations lt = (x, y|1 ≤ x ≤ W, 1 ≤ y ≤ H ). As shown in Fig. 3.7, at
each time step t, the policy network f π first encodes the current hallucinated facial
image It−1 with a fully connected layer. Then, the LSTM unit in the policy network
fuses the encoded vector with the history action vector h t−1 . Ultimately, a final linear
layer is appended to produce a W × H -way vector, which indicates the probabilities
of all available actions P(lt = (x, y)|st−1 ), with each entry (x, y) indicating the
probability of the next attached patch located in position (x, y). The agent then
takes action lt by stochastically drawing an entry following the action probability
distribution. During testing, we select the location lt with the highest probability.
Reward: The reward is applied to guide the agent to learn the sequential poli-
cies to obtain the entire action sequence. Because our model targets generating a
hallucinated facial image, we define the reward according to the mean squared error
(MSE) after enhancing T attended local patches at the selected locations with the
local enhancement network. Given the fixed local enhancement network f e , we first
compute the final face hallucination result IT by sequentially enhancing a list of
local patches mined by l = l1,2,...,T . The MSE loss is thus obtained by computing
L θπ = E p(l;π) [||Ihr − IT ||2 ], where p(l; π) is the probability distribution produced
by the policy network f π . The reward r at the t-th step can be set as

0 t<T
rt = (3.11)
−L θπ t = T.
When the discounted factor is set as 1, the total discounted reward will be R =
−L θπ .
3.4.3 Local Enhancement Network
The local enhancement network f e is used to enhance the extracted low-resolution

patches. Its input contains the whole facial image It−1 that is rendered by all previous
enhanced results and the selected local patch Iˆt−1 lt
at the current step. As shown in
Fig. 3.7, we pass the input It−1 into two fully connected layers to generate a feature
map that is the same size as the extracted patch Iˆt−1
lt
to encode the holistic information
of It−1 . This feature map is then concatenated with the extracted patch Iˆt−1lt
and goes
through convolution layers to obtain the enhanced patch Iˆtlt .
We employ a cascaded convolution network architecture similar to that of gen-
eral image super-resolution methods [32]. No pooling layers are used between the
convolution layers, and the sizes of the feature maps remain fixed throughout all
the convolution layers. We follow the detailed setting of the network employed by
Tuzel et al. [33]. Two fully connected layers contain 256 neurons. The cascaded
convolution network is composed of eight layers. The conv1 and conv7 layers have
16 channels of 3 × 3 kernels; the conv2 and conv6 layers have 32 channels of 7 × 7
kernels; the conv3, conv4, and conv5 layers have 64 channels of size 7 × 7 kernel;
and the conv8 layer has a kernel size of 5 × 5 and outputs the enhanced image patch
with the same size and channel as those of the extracted patch.
In the initialization, we first upsample image Ilr to the same size as that of high-
resolution image Ihr with the bicubic method. Our network first generates a residual
map and then combines the input low-resolution patch with the residual map to
produce the final high-resolution patch. Learning from the residual map has been
verified to be more effective than directly learning from the original high-resolution
images [34].
3.4.4 Deep Reinforcement Learning
Our attention-FH framework jointly trains the parameters θπ of the recurrent policy
network f π and parameters θe of the local enhancement network f e . We introduce a
reinforcement learning scheme to perform joint optimization.
First, we optimize the recurrent policy network with the REINFORCE algorithm
[35] guided by the reward given at the end of sequential enhancement. The local
enhancement network is optimized with mean squared error between the enhanced
patch and the corresponding patch from the ground truth high-resolution image.
This supervised loss is calculated at each time step and can be minimized based on
backpropagation.
Because we jointly train the recurrent policy network and local enhancement
network, the change of parameters in the local enhancement network will affect the
final face hallucination result, which in turn will cause a nonstationary objective for
the recurrent policy network. We further employ the variance reduction strategy, as
mentioned in [36], to reduce variance due to the moving rewards during the training
procedure.
3.4.5 Experiments
To analyze the performance of our recurrent attention-memorized enhancement

model, we compare our method with other state-of-the-art methods including generic
image super-resolution and face hallucination approaches Table 3.1.
Extensive experiments on the BioID [37] and LFW [38] datasets are evaluated. The
BioID dataset contains 1521 facial images collected under laboratory-constrained
settings. We use 1028 images for training and 493 images for evaluation. The LFW
dataset contains 5749 identities and 13233 facial images taken in an unconstrained
environment, of which 9526 images are used for training, and the remaining 3707
images are used for evaluation.
This training/testing split follows the split provided by the LFW datasets. In our
experiment, we first align the images on the BioID dataset using the SDM method
[17] and then crop the center image patches to sizes of 160 × 120 as the facial
images to be processed. For the LFW dataset, we use aligned facial images provided
in the LFW-funneled dataset [39] and extract the centric 128 × 128 image patches
for processing.
We set the maximum time steps at T = 25 in our attention-FH model for both
datasets. The face patch size is H × W = 60 × 45 for all experiments. The network
is updated using ADAM gradient descent [42]. The learning rate and the momentum
term are set to 0.0002 and 0.5, respectively [43–47].
Table 3.1 Comparison between our method and others in terms of PSNR, SSIM, and FSIM eval-
uation metrics
Methods LFW-funneled 8× BioID 8×
PSNR SSIM FSIM PSNR SSIM FSIM
Bicubic 21.92 0.6712 0.7824 20.68 0.6434 0.7539
SFH [40] 22.12 0.6732 0.7832 20.31 0.6234 0.7238
BCCNN 22.62 0.6801 0.7903 21.40 0.6504 0.7621
[23]
MZQ [41] 22.12 0.6771 0.7802 21.11 0.6481 0.7594
SRCNN 23.92 0.6927 0.8314 22.34 0.6980 0.8274
[26]
VDSR [32] 24.12 0.7031 0.8391 24.31 0.7321 0.8465
GLN [33] 24.51 0.7109 0.8405 24.76 0.7421 0.8525
Our method 26.17 0.7604 0.8630 26.56 0.7864 0.8748
References
1. L. Liu, G. Li, Y. Xie, Y. Yu, Q. Wang, L. Lin, Facial landmark machines: a backbone-branches
architecture with progressive representation learning. IEEE Trans. Multimedia. https://doi.org/
10.1109/TMM.2019.2902096
2. Q. Cao, L. Lin, Y. Shi, X. Liang, G. Li, Attention-aware face hallucination via deep reinforce-
ment learning, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Honolulu, HI, pp. 1656–1664 (2017). https://doi.org/10.1109/CVPR.2017.180
3. P. Luo, X. Wang, X. Tang, A deep sum-product architecture for robust facial attributes analysis,
in ICCV, pp. 2864–2871 (2013)
4. C. Lu, X. Tang, Surpassing human-level face verification performance on lfw with gaussianface,
in AAAI (2015)
5. L. Liu, C. Xiong, H. Zhang, Z. Niu, M. Wang, S. Yan, Deep aging face verification with large
gaps. TMM 18(1), 64–75 (2016)
6. Z. Zhu, P. Luo, X. Wang, X. Tang, Deep learning identity-preserving face space, in ICCV, pp.
113–120 (2013)
7. C. Ding, D. Tao, Robust face recognition via multimodal deep face representation. TMM
17(11), 2049–2058 (2015)
8. Y. Li, L. Liu, L. Lin, Q. Wang, Face recognition by coarse-to-fine landmark regression with
application to atm surveillance, in CCCV (Springer, 2017), pp. 62–73
9. P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in CVPR,
vol. 1. IEEE, pp. I–511 (2001)
10. X. Zhu, D. Ramanan, Face detection, pose estimation, and landmark localization in the wild,
in CVPR (IEEE, 2012), pp. 2879–2886
11. Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste, W. Di, Y. Yu, Hd-cnn: hierarchical
deep convolutional neural networks for large scale visual recognition, in ICCV, pp. 2740–2748
(2015)
12. M. Köstinger, P. Wohlhart, P.M. Roth, H. Bischof, Annotated facial landmarks in the wild: a
large-scale, real-world database for facial landmark localization, in ICCV Workshops (IEEE,
2011), pp. 2144–2151
13. Z. Zhang, P. Luo, C.C. Loy, X. Tang, Facial landmark detection by deep multi-task learning,
14. X. Burgos-Artizzu, P. Perona, P. Dollár, Robust face landmark estimation under occlusion, in
ICCV, pp. 1513–1520 (2013)
15. X. Cao, Y. Wei, F. Wen, J. Sun, Face alignment by explicit shape regression. IJCV 107(2),
177–190 (2014)
16. X. Yu, J. Huang, S. Zhang, W. Yan, D. Metaxas, Pose-free facial landmark fitting via optimized
part mixtures and cascaded deformable shape model, in ICCV, pp. 1944–1951 (2013)
17. X. Xiong, F. Torre, Supervised descent method and its applications to face alignment, in CVPR,
pp. 532–539 (2013)
18. K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using multitask cascaded
convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)
19. S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, A. Kassim, Robust facial landmark detection via
recurrent attentive-refinement networks, in ECCV (Springer, 2016), pp. 57–72
20. Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in ICCV, pp.
3730–3738 (2015)
21. Z. Zhang, P. Luo, C.C. Loy, X. Tang, Learning deep representation for face alignment with
auxiliary attributes. IEEE Trans. Pattern Anal. Mach. Intell. 38(5), 918–930 (2016)
22. E. Zhou, Z. Cao, Q. Yin, Naive-deep face recognition: touching the limit of lfw benchmark or
not? arXiv preprint arXiv:1501.04690 (2015)
23. E. Zhou, H. Fan, Z. Cao, Y. Jiang, Q. Yin, Learning face hallucination in the wild, in AAAI,
pp. 3871–3877 (2015)
24. S. Zhu, S. Liu, C.C. Loy, X. Tang, Deep cascaded bi-network for face hallucination. arXiv
preprint arXiv:1607.05046 (2016)
References 45
25. C. Liu, H.-Y. Shum, W.T. Freeman, Face hallucination: theory and practice. Int. J. Comput.
Vis. 75(1), 115–134 (2007)
26. C. Dong, C.C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-
resolution, in ECCV, pp. 184–199 (2014)
27. J. Najemnik, W.S. Geisler, Optimal eye movement strategies in visual search. Nature 434(7031),
387–391 (2005)
28. Y. Sun, D. Liang, X. Wang, X. Tang, Deepid3: face recognition with very deep neural networks.
29. J.C. Caicedo, S. Lazebnik, Active object localization with deep reinforcement learning, in
ICCV, pp. 2488–2496 (2015)
30. K. Gregor, I. Danihelka, A. Graves, D.J. Rezende, D. Wierstra, DRAW: a recurrent neural
network for image generation, in ICLR, pp. 1462–1471 (2015)
31. D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre et al., Mastering the game of go with
deep neural networks and tree search. Nature 529, 484–503 (2016)
32. J. Kim, J.K. Lee, K.M. Lee, Accurate image super-resolution using very deep convolutional
networks (2016)
33. O. Tuzel, Y. Taguchi, J.R. Hershey, Global-local face upsampling network. arXiv preprint
arXiv:1603.07235 (2016)
34. S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, L. Zhang, Convolutional sparse coding for image
super-resolution, in ICCV, pp. 1823–1831 (2015)
35. R.J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Mach. Learn. 8(3), 229–256 (1992)
36. V. Mnih, N. Heess, A. Graves, K. kavukcuoglu, Recurrent models of visual attention, in NIPS,
pp. 2204–2212 (2014)
37. O. Jesorsky, K.J. Kirchberg, R. Frischholz, Robust face detection using the hausdorff distance,
in AVBPA, pp. 90–95 (2001)
38. G.B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: A database for
studying face recognition in unconstrained environments. Technical Report 07-49, University
of Massachusetts, Amherst, October 2007
39. G.B. Huang, V. Jain, E. Learned-Miller, Unsupervised joint alignment of complex images, in
ICCV (2007)
40. C.-Y. Yang, S. Liu, M.-H. Yang, Structured face hallucination, in CVPR, pp. 1099–1106 (2013)
41. X. Ma, J. Zhang, C. Qi, Hallucinating face by position-patch. Pattern Recogn. 43(6), 2224–2236
(2010)
42. D.P. Kingma, J. Ba. Adam: amethod for stochastic optimization, in ICLR (2015)
43. T. Chen, L. Lin, L. Liu, X. Luo, X. Li, Disc: deep image saliency computing via progressive
representation learning. TNNLS 27(6), 1135–1149 (2016)
44. L. Liu, H. Wang, G. Li, W. Ouyang, L. Lin, Crowd counting using deep recurrent spatial-aware
network, in IJCAI (2018)
45. L. Liu, R. Zhang, J. Peng, G. Li, B. Du, L. Lin, Attentive crowd flow machines, in ACM MM
(ACM, 2018), pp. 1553–1561
46. Y. Sun, X. Wang, X. Tang, Deep convolutional network cascade for facial point detection, in
CVPR, pp. 3476–3483 (2013)
47. R. Weng, J. Lu, Y.-P. Tan, J. Zhou, Learning cascaded deep auto-encoder networks for face
alignment. TMM 18(10), 2066–2078 (2016)
Chapter 4
Pedestrian Detection with RPN
and Boosted Forest
Abstract Although recent deep learning object detectors have shown excellent per-
formance for general object detection, they have limited success in detecting pedestri-
ans; therefore, previous leading pedestrian detectors were generally hybrid methods
combining handcrafted and deep convolutional features. In this chapter, we propose a
very simple but effective baseline for pedestrian detection using an RPN followed by
boosted forest on shared high-resolution convolutional feature maps. We comprehen-
sively evaluate this method on several benchmarks and find that it shows competitive
accuracy and good speed.
4.1 Introduction
In this section, we investigate the issues involving Faster R-CNN as a pedestrian

detector. Faster R-CNN [1] is a particularly successful method for general object de-
tection. It consists of two components: a fully convolutional region proposal network
(RPN) for proposing candidate regions followed by a downstream Fast R-CNN [2]
classifier. The Faster R-CNN system is thus a purely CNN-based method that does
not use handcrafted features (e.g., a selective search [3] based on low-level features).
Despite its leading accuracy on several multicategory benchmarks, Faster R-CNN
has not presented competitive results on popular pedestrian detection datasets (e.g.,
the Caltech dataset [4]). Interestingly, we find that an RPN specially tailored for
pedestrian detection achieves competitive results as a standalone pedestrian detec-
tor. Surprisingly, the accuracy is degraded after the RPN proposals are fed into the
Fast R-CNN classifier. We argue that such unsatisfactory performance is attributable
to the following two reasons.
First, the convolutional feature maps of the Fast R-CNN classifier are of low reso-
lution for detecting small objects. Typical scenarios of pedestrian detection, such as
automatic driving and intelligent surveillance, generally present pedestrian instances
of small sizes (e.g., 28 × 70 for Caltech [4]). On small objects (Fig. 4.1a), the region-
of-interest (ROI) pooling layer [2, 5] performed on a low-resolution feature map
(usually with a stride of 16 pixels) can lead to “plain” features caused by collapsing
bins. These features are not discriminative in small regions and thus degrade the

https://doi.org/10.1007/978-981-13-2387-4_4
48 4 Pedestrian Detection with RPN and Boosted Forest
(a) Small positive instances (b) Hard negatives
Fig. 4.1 Two challenges for Fast/Faster R-CNN in pedestrian detection. a Small objects for which
ROI pooling on low-resolution feature maps may fail. b Hard negative examples that receive no
careful attention in Fast/Faster R-CNN
downstream classifier. We note that this occurrence is in contrast to handcrafted fea-

tures that have finer resolutions. We address this problem by pooling features from
shallower but higher resolution layers and by the hole algorithm (namely, “à trous”
[6], or filter rarefaction [7]) that increases feature map size.
Second, in pedestrian detection, the false predictions are predominantly caused
by confusion with hard negative background instances (Fig. 4.1b). This is in contrast
to general object detection, in which a main source of confusion is multiple cate-
gories. To address hard negative examples, we adopt cascaded boosted forest (BF)
[8, 9], which performs effective hard negative mining (bootstrapping) and sample
reweighting, to classify the RPN proposals. In contrast to previous methods that use
handcrafted features to train the forest, in our method, the BF reuses the deep convo-
lutional features of the RPN. This strategy not only reduces the computational cost
of the classifier by sharing features but also exploits the deeply learned features.
We present a surprisingly simple but effective baseline for pedestrian detection
based on an RPN and BF. Our method overcomes two limitations of Faster R-CNN
for pedestrian detection and eliminates traditional handcrafted features. We present
compelling results on several benchmarks, including Caltech [4], INRIA [10], ETH
[11], and KITTI [12]. Remarkably, compared with other methods, our method has
substantially better localization accuracy and shows a relative improvement of 40%
on the Caltech dataset under an intersection-over-union (IoU) threshold of 0.7 for
evaluation. Meanwhile, our method has a test-time speed of 0.5 seconds per image,
which is competitive with the test-time speed of the previous leading methods.
In addition, our paper reveals that traditional pedestrian detectors have been in-
herited in recent methods for at least two reasons. First, the higher resolution of
handcrafted features (such as [13, 14]) and their pyramids is good for detecting
small objects. Second, effective bootstrapping is performed to mine hard negative
examples. When these key factors are appropriately handled in a deep learning sys-
tem, they lead to excellent results.
4.2 Approach 49
4.2 Approach
Our approach consists of two components (illustrated in Fig. 4.2): an RPN that gen-
erates candidate boxes as well as convolutional feature maps and a boosted forest
that classifies these proposals using these convolutional features.
4.2.1 Region Proposal Network for Pedestrian Detection
The RPN in Faster R-CNN [1] was developed as a class-agnostic detector (proposer)
in the scenario of multicategory object detection. For single-category detection, RPN
is naturally a detector for the only category concerned. We specifically tailor the RPN
for pedestrian detection, as introduced in the following sections.
We adopt anchors (reference boxes) with a single aspect ratio of 0.41 (width to
height). This is the average aspect ratio of pedestrians, as indicated in [4]. This ap-
proach differs from that of the original RPN, which has anchors with multiple aspect
ratios. Anchors with inappropriate aspect ratios are associated with few examples
and thus are noisy and harmful for detection accuracy. In addition, we use anchors of
9 different scales, starting from a 40-pixel height with a scaling stride of 1.3×. This
spans a wider range of scales than the original RPN. The usage of multiscale anchors
enables us to waive the requirement of using feature pyramids to detect multiscale
objects.
We adopt the VGG-16 net [15] pretrained on the ImageNet dataset [16] as the
backbone network. The RPN is built on top of the Conv5_3 layer, which is followed by
an intermediate 3 × 3 convolutional layer and two sibling 1 × 1 convolutional layers
for classification and bounding box regression. In this way, RPN regresses boxes
with a stride of 16 pixels (Conv5_3). The classification layer provides confidence
scores for the predicted boxes, which can be used as the initial scores of the boosted
forest cascade that follows.
RPN Boxes Boosted Forest
Scores
…
Features
Fig. 4.2 Our pipeline. An RPN is used to compute candidate bounding boxes, scores, and convolu-
tional feature maps. The candidate boxes are fed into cascaded boosted forest (BF) for classification,
using the features pooled from the convolutional feature maps computed by the RPN
4.2.2 Feature Extraction
With the proposals generated by the RPN, we adopt ROI pooling [2] to extract fixed-
length features from regions. These features are used to train BF, as described in the
next section. Unlike Faster R-CNN, which requires that these features be fed into the
original fully connected (fc) layers and thus limits their dimensions, the BF classifier
imposes no constraint on the dimensions of the features. For example, we can extract
features from ROIs on Conv3_3 (stride = 4 pixels) and Conv4_3 (stride = 8 pixels).
We pool the features into a fixed resolution of 7 × 7. These features from different
layers are simply concatenated without normalization owing to the flexibility of the
BF classifier; in contrast, feature normalization must be carefully addressed [17] for
deep classifiers when concatenating features.
Remarkably, as there is no constraint imposed on feature dimensions, we have the
flexibility to use features with increased resolution. In particular, given the fine-tuned
layers from the RPN (stride = 4 on Conv3, 8 on Conv4, and 16 on Conv5), we can
use the à trous trick [6] to compute higher resolution convolutional feature maps. For
example, we can set the stride of Pool3 at 1 and dilate all Conv4 filters by 2, which
reduces the stride of Conv4 from 8 to 4. In contrast to previous methods [6, 7] that
fine-tune the dilated filters, in our method, we use them only for feature extraction
and do not fine-tune a new RPN.
Although we adopt the same ROI resolution (7 × 7) as that of Faster R-CNN
[1], these ROIs are on higher resolution feature maps (e.g., Conv3_3, Conv4_3, or
Conv4_3 à trous) than Fast R-CNN (Conv53̆). If an ROI input resolution is smaller
than the output (i.e., <7 × 7), the pooling bins collapse, and the features become
“flat” and not discriminative. This problem is alleviated in our method, as it is not
constrained to use Conv5_3 features in the downstream classifier.
4.2.3 Boosted Forest
The RPN generates region proposals, confidence scores, and features, all of which are
used to train a cascaded boosted forest classifier. We adopt the RealBoost algorithm
[8] and mainly follow the hyperparameters in [18]. Formally, we bootstrap the train-
ing 6 times, and the forest in each stage has {64, 128, 256, 512, 1024, 1536} trees.
Initially, the training set consists of all positive examples (∼50k on the Caltech set)
and the same number of randomly sampled negative examples from the proposals.
After each stage, additional hard negative examples (whose number is 10% of the
positives, ∼5k on Caltech) are mined and added to the training set. Finally, a forest
of 2048 trees is trained after all bootstrapping stages. This final forest classifier is
used for inference.
We note that it is not necessary to treat the initial proposals equally because the
initial confidence scores of the proposals are computed by the RPN. In other words,
4.2 Approach 51
the RPN can be considered as the stage-0 classifier f 0 , and we set f 0 = 21 log 1−s
s
following the RealBoost form, where s is the score of a proposed region ( f 0 is a

constant in standard boosting). The other stages are as in the standard RealBoost.
4.3 Experiments and Analysis
Caltech Figures 4.3 and 4.5 show the results on the Caltech dataset. When original
annotations are used (Fig. 4.3), our method has an MR of 9.6%, which is more than 2
points better than that of the closest competitor (11.7% of CompactACT-Deep [18]).
When the corrected annotations are used (Fig. 4.5), our method has an MR−2 of 7.3%
and an MR−4 of 16.8%, both of which are 2 points better than those of the previous
best methods.
1
.80 23.3% SCF+AlexNet
.64 22.5% Katamari
.50 21.9% SpatialPooling+
.40 21.9% SCCPriors
.30 20.9% TA-CNN
miss rate
.20 18.7% CCF

18.5% Checkerboards
17.3% CCF+CF
.10 17.1% Checkerboards+
11.9% DeepParts
.05 11.7% CompACT-Deep
9.6% RPN+BF [Ours]
10 -2 10 0
false positives per image
Fig. 4.3 Comparisons on the Caltech set (legends indicate MR)
1
.80 68.6% CCF
.64 49.0% Katamari
.50 48.9% LDCF
.40 48.6% SCCPriors
.30 47.2% SCF+AlexNet
miss rate
.20 46.8% TA-CNN

46.7% SpatialPooling+
41.8% Checkerboards
.10 41.3% Checkerboards+
40.7% DeepParts
.05 38.1% CompACT-Deep
23.5% RPN+BF [Ours]
10 -2 10 0
Fig. 4.4 Comparisons on the Caltech set using an IoU threshold of 0.7 to determine true positives
(legends indicate MR)
1
.80 23.7(42.6)% CCF
.64 23.7(38.3)% LDCF
.50 22.3(42.0)% CCF+CF
.40 22.2(34.6)% Katamari
.30 21.6(34.6)% SCF+AlexNet
miss rate
21.6(36.0)% SpatialPooling+
.20 19.2(34.0)% SCCPriors
18.8(34.3)% TA-CNN
.10 16.3(28.7)% Checkerboards+
15.8(28.6)% Checkerboards
12.9(25.2)% DeepParts
.05 9.2(18.8)% CompACT-Deep
7.3(16.8)% RPN+BF [Ours]
10 -2 10 0
Fig. 4.5 Comparisons on the Caltech-New set (legends indicate MR−2 (MR−4 ))
1
.80 16.0% VeryFast
.64 16.0% WordChannels
.50 15.4% RandForest
.40 15.1% NAMC
.30 14.5% SCCPriors
miss rate
.20 14.4% InformedHaar

13.8% LDCF
13.7% Franken
.10 13.5% Roerei
13.3% SketchTokens
.05 11.2% SpatialPooling
6.9% RPN+BF [Ours]
10 -2 10 0
Fig. 4.6 Comparisons on the INRIA dataset (legends indicate MR)
In addition, except for CCF (MR 18.7%) [19], our method (MR 9.6%) is the
only method that uses no handcrafted features. Our results suggest that handcrafted
features are not essential for good accuracy on the Caltech dataset; rather, high-
resolution features and bootstrapping, both of which are missing in the original Fast
R-CNN detector, are the keys to good accuracy.
Figure 4.4 shows the results on Caltech, where an IoU threshold of 0.7 is
used to determine true positives (instead of 0.5 by default). With this more chal-
lenging metric, most methods exhibit a dramatic performance decrease; e.g., the
MR of CompactACT-Deep [18]/DeepParts [20] increases from 11.7%/11.9% to
38.1%/40.7%. Our method has an MR of 23.5%, which is a relative improvement
of ∼40% over that of the closest competitors. This comparison demonstrates that
our method has substantially better localization accuracy than other methods. It also
indicates that there is much room to improve localization performance on this widely
evaluated dataset.
INRIA and ETH Figures 4.6 and 4.7 show the results on the INRIA and ETH
datasets. On the INRIA dataset, our method achieves an MR of 6.9%, which is
4.3 Experiments and Analysis 53
1
.80 89.9% VJ
.64 64.2% HOG
49.4% MLS
.50
47.3% MF+Motion+2Ped
.40
47.0% DBN-Isol
.30 45.3% JointDeep
miss rate
45.0% RandForest
.20 45.0% LDCF
44.8% FisherBoost
43.5% Roerei
.10 41.1% DBN-Mut
40.0% Franken
37.4% SpatialPooling
.05 35.0% TA-CNN
30.2% RPN+BF [Ours]
10 -2 10 0
Fig. 4.7 Comparisons on the ETH dataset (legends indicate MR)
considerably better than that of the best available competitor (11.2%). On the ETH
set, our result (30.2%) is better than that of the previous leading method (TA-CNN
[21]) by 5 points.
References
1. R. Girshick, S. Ren, K. He, J. Sun, Faster r-cnn: Towards real-time object detection with region
proposal networks, in Neural Information Processing Systems (NIPS) (2015)
2. R. Girshick, Fast r-cnn, in IEEE International Conference on Computer Vision (ICCV) (2015)
3. J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, Selective search for object
recognition. IJCV 104(2), 154–171 (2013)
4. B. Schiele, P. Dollár, C. Wojek, P. Perona, Pedestrian detection: an evaluation of the state of
the art (2012)
5. S.R.K. He, X. Zhang, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual
recognition, in European Conference on Computer Vision (ECCV) (2014)
6. I. Kokkinos, K. Murphy, L.-C. Chen, G. Papandreou, A.L. Yuille, Semantic image segmentation
with deep convolutional nets and fully connected crfs, in arXiv:1412.7062 (2014)
7. R. Tibshirani, J. Friedman, T. Hastie, Additive logistic regression: a statistical view of boosting,
in The annals of statistics (2000)
8. P. Dollár R. Appel, T. Fuchs, P. Perona, Quickly boosting decision trees-pruning underachieving
features early, in International Conference on Machine Learning (ICML) (2013)
9. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) (2005)
10. B. Leibe, A. Ess, L. Van Gool, Depth and appearance for mobile scene analysis, in IEEE
11. P. Lenz, A. Geiger, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark
suite, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
12. P. Perona, P. Dollár, Z. Tu, S. Belongie, Integral channel features, in British Machine Vision
Conference (BMVC) (2009)
13. S. Belongie, P. Dollár, R. Appel, P. Perona, Fast feature pyramids for object detection (2014)
14. KarVery deep convolutional networks for large-scale image recognitionen Simonyan and
Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In
arXiv:1409.1556 (2014)
15. H. Su, J. Krause, S. Satheesh, , Ma Zhiheng, H, Andrej, K, Aditya, K, Michael, B. Alexan-
der, C. Berg O. Russakovsky, J. Deng, L. Fei-Fei, Imagenet large scale visual recognition
challenge (2015)
16. Andrew Rabinovich Wei Liu and A.C. Berg, Parsenet: looking wider to see better. page
arXiv:1506.04579 (2015)
17. M. Saberian, Z. Cai, N. Vasconcelos, Learning complexity-aware cascades for deep pedestrian
detection, in IEEE International Conference on Computer Vision (ICCV) (2015)
18. B. Yang, J. Yan, Z. Lei, S. Z. Li, Convolutional channel features, in ICCV, pp. 82–90 (2015)
19. X. Wang, Y. Tian, P. Luo, X. Tang, Deep learning strong parts for pedestrian detection, in IEEE
20. X. Wang, Y. Tian, P. Luo, X. Tang, Pedestrian detection aided by deep learning semantic tasks,
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
21. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in
CVPR, pp. 3431–3440 (2015)
Part III
Parsing Person in Detail
A comprehensive human visual understanding of scenarios in the wild, which is

regarded as one of the most fundamental problems in computer vision, could have a
crucial impact on many higher level application domains such as person reidentifica-
tion [1], video surveillance [2], and human behavior analysis [3, 4]. Human parsing
aims to segment a human image into multiple parts with fine-grained semantics (e.g.,
body parts and clothing) and provides a more detailed understanding of image con-
tent, that is, parsing person in detail, which is one of the most critical and correlated
tasks in analyzing images of humans by providing pixel-wise understanding.
Recent progress in human parsing [5–12] has been achieved by improving fea-
ture representations using CNNs, recurrent neural networks, and complex graphical
models (e.g., conditional random fields (CRFs)). For example, Liang et al. [12] pro-
posed a novel Co-CNN architecture that integrates multiple levels of image contexts
into a unified network. In addition to human parsing, there has also been increas-
ing research interest in the part segmentation of other objects, such as animals or
cars [13–15]. To capture the rich structural information based on the advanced CNN
architecture, common solutions include the combination of CNNs and CRFs [16,
17] and adopting multiscale feature representations [5, 6, 16,]. Chen et al. [5] pro-
posed an attention mechanism that learns to weight multiscale features at each pixel
location. However, without imposing human body structure priors, these general
approaches based on bottom-up appearance information tend to produce unreason-
able results (e.g., right arm connected with left shoulder). Human body structural
information has been previously thoroughly explored in human pose estimation [18,
19], in which dense joint annotations are provided. However, because human parsing
requires more extensive and detailed prediction than pose estimation, it is difficult to
directly utilize joint-based pose estimation models in pixel-wise prediction to incor-
porate the complex structural constraints. To explicitly enforce semantic consistency
between the produced parsing results and the human pose/joint structures, we pro-
pose a novel structure-sensitive learning approach to human parsing. In addition to
using the traditional pixel-wise part annotations as the supervision, we introduce a
structure-sensitive loss to evaluate the quality of the predicted parsing results from
a joint structure perspective. This means that a satisfactory parsing result should be
able to preserve a reasonable joint structure (e.g., the spatial layout of human parts).
56 Part III: Parsing Person in Detail
Furthermore, previous approaches focus only on the single-person parsing task

in simplified and limited conditions such as fashion pictures [8, 9, 12, 20, 21] with
upright poses and diverse daily images [22], and disregard more complex real-world
cases in which multiple person instances appear in one image. Such ill-posed single-
person parsing tasks severely limit the potential application of human analysis to
more challenging scenarios (e.g., group behavior prediction). We make the first effort
to resolve the more challenging instance-level human parsing task, which needs to
not only segment various body parts or clothing but also associate each part with
one instance. In addition to the difficulties shared with single-person parsing (e.g.,
various appearances/viewpoints, self-occlusions), instance-level human parsing is a
more challenging task because the number of person instances in an image varies
immensely, and this variation cannot be conventionally addressed using traditional
single-person parsing pipelines with a fixed prediction space that categorize a fixed
number of part labels.
Beyond the existing single-person and multiple-person human parsing tasks in
static images, we move forward to investigate a more realistic video instance-level
human parsing that simultaneously segments each person instance and parses each
instance into more fine-grained parts (e.g., head, leg, dress) for each frame in a video.
This task is more challenging and is aligned with the requirements of human-centric
analytic applications. In addition to the difficulties shared with single-person parsing
(e.g., various appearances/viewpoints, self-occlusions) and instance-level parsing
(e.g., an uncertain number of instances), video human parsing faces more challenges
that are inevitable in video object detection and segmentation. Recognition accuracy
may be severely affected by deteriorated appearance issues in videos that are seldom
observed in still images such as motion blur and defocus. On the other hand, the
balance between frame-level accuracy and time efficiency is also a crucial factor in
the deployment of diverse devices (such as mobile devices).
References
1. R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-

identification, in CVPR (2013)
2. L. Wang, X. Ji, Q. Deng, M. Jia, Deformable part model based multiple pedestrian
detection for video surveillance in crowded scenes, in VISAPP (2014)
3. C. Gan, M. Lin, Y. Yang, G. de Melo, A.G. Hauptmann, Concepts not alone:
Exploring pairwise relationships for zero-shot video activity recognition, in AAAI
(2016)
4. X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, S. Yan, Proposal-free network for
instance-level object segmentation. arXiv preprint arXiv:1509.02636 (2015)
5. L.C. Chen, Y. Yang, J. Wang, W. Xu, A.L. Yuille, Attention to scale: Scale-aware
semantic image segmentation, in CVPR (2016)
6. F. Xia, P. Wang, L.C. Chen, A.L. Yuille, Zoom better to see clearer: Huamn part
segmentation with auto zoom net, in ECCV (2016)
References 57
7. K. Yamaguchi, M. Kiapour, L. Ortiz, T. Berg, Parsing clothing in fashion pho-

tographs, in CVPR (2012)
8. K. Yamaguchi, M. Kiapour, T. Berg, Paper doll parsing: Retrieving similar styles
to parse clothing items, in ICCV (2013)
9. J. Dong, Q. Chen, W. Xia, Z. Huang, S. Yan, A deformable mixture parsing model
with parselets, in ICCV (2013)
10. E. Simo-Serra, S. Fidler, F. Moreno-Noguer, R. Urtasun, A High Performance
CRF Model for Clothes Parsing, in ACCV (2014)
11. S. Liu, X. Liang, L. Liu, X. Shen, J. Yang, C. Xu, L. Lin, X. Cao, S. Yan: Matching-
CNN Meets KNN: Quasi-Parametric Human Parsing. In: CVPR (2015)
12. X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, S. Yan, Human parsing
with contextualized convolutional neural network, in ICCV (2015)
13. J. Wang, A. Yuille, Semantic part segmentation using compositional model com-
bining shape and appearance, in CVPR (2015)
14. P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, A. Yuille, Joint object and part
segmentation using deep learned potentials, in ICCV (2015)
15. W. Lu, X. Lian, A. Yuille, Parsing semantic parts of cars using graphical models
and segment appearance consistency, in BMVC (2014)
16. L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Deeplab: Seman-
tic image segmentation with deep convolutional nets, atrous convolution, and
fully connected crfs. arXiv preprint arXiv:1606.00915 (2016 TPAMI (2015)
17. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang,
P. Torr, Conditional random fields as recurrent neural networks, in ICCV (2015)
18. W. Yang, W. Ouyang, H. Li, X. Wang, End-to-end learning of deformable mixture
of parts and deep convolutional neural networks for human pose estimation, in
CVPR (2016)
19. X. Chen, A. Yuille, Articulated pose estimation by a graphical model with image
dependent pairwise relations, in NIPS (2014)
20. X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, S. Yan, Deep human
parsing with active template regression, in TPAMI (2015)
21. X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, et al., Detect what you can:
detecting and representing objects using holistic models and body parts, in CVPR
(2014)
22. K. Gong, X. Liang, D. Zhang, X. Shen, L. Lin, Look into person: Self-supervised
structure-sensitive learning and a new benchmark for human parsing, in CVPR
(2017)
Chapter 5
Self-supervised Structure-Sensitive
Learning for Human Parsing
Abstract Human parsing has recently attracted much research interest due to its
enormous application potential. In this chapter, we introduce a new benchmark,
“Look into Person (LIP),” that makes a significant advance in terms of scalability,
diversity, and difficulty, a contribution that we feel is crucial for future developments
in human-centric analysis. Furthermore, in contrast to the existing efforts to improve
feature discriminative capability, we solve human parsing by exploring a novel self-
supervised structure-sensitive learning approach that imposes human pose structures
on the parsing results without requiring extra supervision. Our self-supervised learn-
ing framework can be injected into any advanced neural network to help incorporate
rich high-level knowledge regarding human joints from a global perspective and
improve the parsing results( c
[2019] IEEE., Reprinted, with permission, from [1]).
5.1 Introduction
Human parsing aims to segment a human image into multiple parts with fine-grained
semantics and provide a more detailed understanding of image content. It can simu-
late many higher level computer vision applications [2], such as person reidentifica-
tion [3] and human behavior analysis [4, 5].
Recently, convolutional neural networks (CNNs) have achieved exciting success
in human parsing [6–8]. Nevertheless, as demonstrated in many other problems such
as object detection [9] and semantic segmentation [10], the performance of such
CNN-based approaches relies heavily on the availability of annotated images for
training. To train a human parsing network with potentially practical value in real-
world applications, it is highly desirable to have a large-scale dataset composed of
representative instances with varied clothing appearances, strong articulation, partial
(self-)occlusions, truncation at image borders, diverse viewpoints, and background
clutter. Although training sets exist for special scenarios such as fashion pictures [6,
8, 11, 12] and people in constrained situations (e.g., upright) [13], these datasets
are limited in their coverage and scalability, as shown in Fig. 5.1. The largest public
human parsing dataset [8] thus far contains only 17,000 fashion images, while others
include only thousands of images (Table 5.1).

https://doi.org/10.1007/978-981-13-2387-4_5
60 5 Self-supervised Structure-Sensitive Learning for Human Parsing
Fig. 5.1 Annotation examples for our “Look into Person (LIP)” dataset and existing datasets. a
The images in the ATR dataset are of fixed size and contain only instances of persons standing up
in the outdoors. b The images in the PASCAL-Person-Part dataset also have lower scalability and
contain only 6 coarse labels. c The images in our LIP dataset have high appearance variability and
complexity
Table 5.1 Overview of the publicly available datasets for human parsing. For each dataset, we
report the number of annotated persons in the training, validation, and test sets as well as the
number of categories, including background
Dataset #Training #Validation #Test Categories
Fashionista [14] 456 − 229 56
PASCAL-Person- 1,716 − 1,817 7
Part
[13]
ATR [8] 16,000 700 1,000 18
LIP 30,462 10,000 10,000 20
However, to the best of our knowledge, no attempt has been made to establish a
standard representative benchmark aiming to cover a wide range of challenges for the
human parsing task. The existing datasets do not provide an evaluation server with a
secret test set to avoid potential dataset overfitting, which hinders further development
in this area. Therefore, we propose a new benchmark, “Look into Person (LIP)”, and a
public server to automatically report evaluation results. Our benchmark significantly
advances the state of the art in terms of appearance variability and complexity, as
it includes 50,462 human images with pixel-wise annotations of 19 semantic parts
(Fig. 5.2).
5.2 Look into Person Benchmark 61
(a) (b) (c)
Fig. 5.2 An example shows that self-supervised structure-sensitive learning is helpful for human
parsing. a The original image. b Parsing results by attention-to-scale [15], with the left arm wrongly
labeled as the right arm. c Our parsing results successfully incorporate the structure information to
generate reasonable outputs
5.2 Look into Person Benchmark
With 50,462 annotated images, LIP is an order of magnitude larger and more chal-
lenging than previous similar attempts [8, 13, 14]. It is annotated with elaborated
pixel-wise annotations with 19 semantic human part labels and one background label.
The images collected from real-world scenarios contain people appearing with chal-
lenging poses and viewpoints, heavy occlusions, various appearances, and a wide
range of resolutions. Furthermore, the backgrounds of the images in the LIP dataset
are more complex and diverse than those in previous datasets. Some examples are
shown in Fig. 5.1.
• Image Annotation The images in the LIP dataset are cropped person instances
from the Microsoft COCO [16] training and validation sets. We defined 19 human
parts or clothing labels for annotation, hat, hair, sunglasses, upper clothing, dress,
coat, socks, pants, gloves, scarf, skirt, jumpsuit, face, right arm, left arm, right leg,
left leg, right shoe, and left shoe, as well as a background label. We implement
an annotation tool and generate multiscale superpixels of images based on [17] to
speed up the annotation.
• Dataset splits In total, there are 50,462 images in the LIP dataset, including 19,081
full-body images, 13,672 upper body images, 403 lower body images, 3,386 head-
missing images, 2,778 back-view images, and 21,028 images with occlusions. We
divide the images into separate training, validation, and test sets. Following random
selection, we arrive at a unique division consisting of 30,462 training and 10,000
validation images with publicly available annotations as well as 10,000 test images
with the annotations withheld for benchmarking purposes.
• Dataset statistics In this section, we analyze the images and categories in the LIP
dataset in detail. In general, the face, arms, and legs are the most identifiable
parts of a human body. However, human parsing aims to analyze every detailed
region of a person, including different body parts as well as different categories
of clothing. We, therefore, define 6 body parts and 13 clothing categories. Among
Fig. 5.3 The data distribution of the 19 semantic part labels in the LIP dataset
the 6 body parts, we divide arms and legs into left and right sides for more precise
analysis, which also increases the difficulty of the task. For clothing classes, we
include not only common clothing, such as upper clothing, pants, and shoes, but
also infrequent categories, such as skirts and jumpsuits. Furthermore, small-scale
accessories, such as sunglasses, gloves, and socks, are also taken into account. The
numbers of images for each semantic part label are presented in Fig. 5.3.
The images in the LIP dataset contain diverse human appearances, viewpoints,
and occlusions. Additionally, more than half of the images suffer occlusions of
different degrees. Occlusion is considered to occur if any of the 19 semantic parts
appear in the image but are occluded or invisible. In more challenging cases, the
images contain back-view instances, which give rise to greater ambiguity in the
left and right spatial layouts. The numbers of images of different appearances
(i.e., occlusion, full-body, upper body, head-missing, back-view, and lower body
images) are summarized in Fig. 5.4.
5.3 Self-supervised Structure-Sensitive Learning
As previously mentioned, a major limitation of the existing human parsing approaches

is the lack of consideration of human body configuration, which is mainly investi-
gated in the human pose estimation problem. Human parsing and pose estimation aim
to label each image with different granularities, that is, pixel-wise semantic labeling
versus joint-wise structure prediction. Pixel-wise labeling can address more detailed
information, while joint-wise structure provides a more high-level approach. How-
ever, the results of state-of-the-art pose estimation models [21, 22] still have many
5.3 Self-supervised Structure-Sensitive Learning 63
Fig. 5.4 The numbers of images that show diverse types of visibility in the LIP dataset, including
occlusion, full-body, upper body, lower body, head-missing, and back-view images
Fig. 5.5 Illustration of self-supervised structure-sensitive learning for human parsing. An input
image goes through parsing networks, including several convolutional layers, to generate the parsing
results. The generated joints and joints ground truth, represented as heatmaps, are obtained by
computing the center points of the corresponding regions in parsing maps, including head (H),
upper body (U), lower body (L), right arm (RA), left arm (LA), right leg (RL), left leg (LL), right
shoe (RS), and left shoe (LS). The structure-sensitive loss is generated by weighting segmentation
loss with joint structure loss. For clear observation, we combine nine heatmaps into one map
errors. The predicted joints do not have high enough quality to guide human parsing
compared with the joints extracted from parsing annotations. Moreover, the joints
in pose estimation are not aligned with parsing annotations. For example, the arms
are labeled as arms for parsing annotations only if they are not covered by clothing,
while the pose annotations are independent of clothing. To address these issues in this
work, we investigate how to leverage informative high-level structure cues to guide
pixel-wise prediction. We propose a novel self-supervised structure-sensitive learn-
ing for human parsing, which introduces a self-supervised structure-sensitive loss to
evaluate the quality of predicted parsing results from a joint structure perspective, as
illustrated in Fig. 5.5.
Specifically, in addition to using the traditional pixel-wise annotations for super-
vision, we generate the approximated human joints directly from the parsing anno-
tations, which can also guide human parsing training. To explicitly enforce semantic
consistency between the produced parsing results and human joint structures, we
treat the joint structure loss as a weight of segmentation loss, which becomes our
structure-sensitive loss.
5.3.1 Self-supervised Structure-Sensitive Loss
Generally, for the human parsing task, no extensive information is provided other
than the pixel-wise annotations. Instead of using augmentative information, we must
obtain a structure-sensitive supervision from the parsing annotations. As the human
parsing results are semantic parts with pixel-level labels, we try to explore the pose
information contained in human parsing results. We define 9 joints to construct a pose
structure, which are the centers of the regions of the head, upper body, lower body,
left arm, right arm, left leg, right leg, left shoe, and right shoe. The region of the head
is generated by merging the parsing labels of hat, hair, sunglasses, and face. Similarly,
upper clothing, coat, and scarf are merged into upper body and pants and skirt into
lower body. The remaining regions can also be obtained by the corresponding labels.
Some examples of human joints generated for different humans are shown in Fig. 5.6.
Following [23], for each parsing result and corresponding ground truth, we compute
the center points of regions to more smoothly represent joints as heatmaps for training.
Then, we use the Euclidean metric to evaluate the quality of the generated joint
structures, which also reflect the structural consistency between the predicted parsing
results and the ground truth. Finally, the pixel-wise segmentation loss is weighted by
the joint structure loss, which becomes our structure-sensitive loss. Consequently,
the overall human parsing networks become self-supervised with structure-sensitive
loss.
p
Formally, given an image I , we define a list of joint configurations C IP = {ci |i ∈
p
[1, N ]}, where ci is the heatmap of the ith joint computed according to the parsing
gt
result map. Similarly, C IGT = {ci |i ∈ [1, N ]}, which is obtained from the corre-
sponding parsing ground truth. Here, N is a variate decided by the human bodies in
the input images and is equal to 9 for a full-body image. For the joints missing from
the image, we simply replace the heatmaps with maps filled in with zeros. The joint
structure loss is the Euclidean (L2) loss, calculated as
1 p
N
gt
L Joint = c − ci 22 , (5.1)
2N i=1 i
Fig. 5.6 Some examples of self-supervised human joints generated from our parsing results for
different bodies
(a)
(b)
(c)
(d)
(e)
Fig. 5.7 Visualized comparison of human parsing results on the LIP validation set. a Upper body
images. b The back-view images. c The head-missing images. d The images with occlusion. e The
full-body images
The final structure-sensitive loss, denoted as L Structure , is the combination of the

joint structure loss and the parsing segmentation loss, calculated as
L Structure = L Joint · L Parsing , (5.2)
where L Parsing is the pixel-wise softmax loss calculated based on the parsing annota-
tions.
We refer to our learning framework as “self-supervised” as the abovementioned
structure-sensitive loss can be generated from the existing parsing results with no
additional information. Our self-supervised learning framework thus has excellent
adaptability and extensibility and can be injected into any advanced network to
incorporate rich high-level knowledge about human joints from a global perspective
(Fig. 5.7).
5.3.2 Experimental Result
Dataset: We evaluate the performance of our self-supervised structure-sensitive

learning method on the human parsing task on two challenging datasets. The first is
the public PASCAL-Person-Part dataset, with 1,716 images for training and 1,817
for testing, which follows the human part segmentation annotated by [13]. Follow-
Table 5.2 Performance comparison in terms of per-class IoU with four state-of-the-art methods
on the LIP validation set
Method Hat Hair Gloves Sunglasses u-clothes Dress Coat Socks Pants Jumpsuit
SegNet [18] 26.60 44.01 0.01 0.00 34.46 0.00 15.97 3.59 33.56 0.01
FCN-8s [19] 39.79 58.96 5.32 3.08 49.08 12.36 26.82 15.66 49.41 6.48
DeepLabV2 [20] 57.94 66.11 28.50 18.40 60.94 23.17 47.03 34.51 64.00 22.38
Attention [15] 58.87 66.78 23.32 19.48 63.20 29.63 49.70 35.23 66.04 24.73
DeepLabV2 + SSL 58.41 66.22 28.76 20.05 62.26 21.18 48.17 36.12 65.16 22.94
Attention + SSL 59.75 67.25 28.95 21.57 65.30 29.49 51.92 38.52 68.02 24.48
Table 5.3 Comparison of person part segmentation performance with four state-of-the-art methods
on the PASCAL-Person-Part dataset [13]
Method head torso u-arms l-arms u-legs l-legs Bkg Avg
DeepLab-Large 78.09 54.02 37.29 36.85 33.73 29.61 92.85 51.78
FOV [20]
HAZN [24] 80.79 59.11 43.05 42.76 38.99 34.46 93.59 56.11
Attention [15] 81.47 59.06 44.15 42.50 38.28 35.62 93.65 56.39
LG-LSTM [7] 82.72 60.99 45.40 47.76 42.33 37.96 88.63 57.97
Attention + SSL 83.26 62.40 47.80 45.58 42.32 39.48 94.68 59.36
ing [15, 24], the annotations are merged into six person part classes, head, torso,
upper/lower arms and upper/lower legs, and one background class. The second is
our large-scale LIP dataset, which is highly challenging, with high pose complex-
ity, heavy occlusions, and body truncation, as introduced and analyzed in Sect. 5.2
(Tables 5.2 and 5.3).
References
1. K. Gong, X. Liang, D. Zhang, X. Shen, L. Lin, Look into Person: Self-Supervised Structure-
Sensitive Learning and a New Benchmark for Human Parsing, in 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Honolulu, pp. 6757–6765 (2017)
2. H. Zhang, G. Kim, E.P. Xing, Dynamic topic modeling for monitoring market competition from
online text and image data, in ACM SIGKDD (ACM, 2015)
3. R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-identification, in
CVPR (2013)
4. C. Gan, M. Lin, Y. Yang, G. de Melo, A.G. Hauptmann, Concepts not alone: Exploring pairwise
relationships for zero-shot video activity recognition, in AAAI (2016)
5. X. Liang, Y. Wei, X. Wei, J. Wei, L. Lin, S. Yan, Proposal-free network for instance-level object
segmentation. arXiv preprint arXiv:1509.02636 (2015)
6. X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, S. Yan, Deep human parsing with
active template regression, in TPAMI (2015)
7. X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, S. Yan, Semantic object parsing with local-global
long short-term memory, in CVPR (2016)
8. X. Liang, C. Xu, X. Shen, J. Yang, S. Liu, J. Tang, L. Lin, S. Yan, Human parsing with
contextualized convolutional neural network, in ICCV (2015)
9. X. Liang, S. Liu, Y. Wei, L. Liu, L. Lin, S. Yan, Towards computational baby learning: a
weakly-supervised approach for object detection, in Proceedings of the IEEE International
Conference on Computer Vision, pp. 999–1007 (2015)
10. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. Torr,
Conditional random fields as recurrent neural networks, in ICCV (2015)
11. K. Yamaguchi, M. Kiapour, T. Kiapour, Paper doll parsing: Retrieving similar styles to parse
clothing items, in ICCV (2013)
12. J. Dong, Q. Chen, W. Xia, Z. Huang, S. Yan, A deformable mixture parsing model with parselets,
in ICCV (2013)
13. X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, et al., Detect what you can: detecting and
representing objects using holistic models and body parts, in CVPR (2014)
14. K. Yamaguchi, M. Kiapour, L. Ortiz, T. Berg, Parsing clothing in fashion photographs, in CVPR
(2012)
15. L.C. Chen, Y. Yang, J. Wang, W. Xu, A.L. Yuille, Attention to scale: Scale-aware semantic
image segmentation, in CVPR (2016)
16. T. Lin, M. Maire, S.J. Belongie, L.D. Bourdev, R.B. Girshick, J. Hays, P. Perona, D. Ramanan,
P. Dollár, C.L. Zitnick, Microsoft COCO: common objects in context. CoRR abs/1405.0312
(2014)
17. Pablo Arbelaez, Michael Maire, Charless Fowlkes, Jitendra Malik, Contour detection and
hierarchical image segmentation. TPAMI 33(5), 898–916 (2011). May
18. V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: adeep convolutional encoder-decoder archi-
tecture for image segmentation. arXiv preprint arXiv:1511.00561 (2015)
19. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation,
20. I.Kokkinos, K. Murphy, L.-C. Chen, G. Papandreou, A.L. Yuille, Semantic image segmentation
with deep convolutional nets and fully connected crfs. arXiv:1412.7062 (2014)
21. W. Yang, W. Ouyang, H. Li, X. Wang, End-to-end learning of deformable mixture of parts and
deep convolutional neural networks for human pose estimation, in CVPR (2016)
22. X. Chen, A. Yuille, Articulated pose estimation by a graphical model with image dependent
pairwise relations, in NIPS (2014)
23. T. Pfister, J. Charles, A. Zisserman, Flowing convnets for human pose estimation in videos, in
ICCV (2015)
24. F. Xia, P. Wang, L.C. Chen, A.L. Yuille, Zoom better to see clearer: Huamn part segmentation
with auto zoom net, in ECCV (2016)
Chapter 6
Instance-Level Human Parsing
Abstract Instance-level human parsing in real-world human analysis scenarios is

still underexplored due to the absence of sufficient data resources and technical dif-
ficulty in parsing multiple instances in a single pass. In this chapter, we make the
first attempt to explore a detection-free part grouping network (PGN) to efficiently
parse multiple people in an image in a single pass. PGN reformulates instance-level
human parsing as twinned subtasks that can be jointly learned and mutually refined
via a unified network: (1) semantic part segmentation for assigning each pixel as a
human part and (2) instance-aware edge detection to group semantic parts into dis-
tinct person instances. Thus, the shared intermediate representation is endowed with
capabilities in both characterizing fine-grained parts and inferring instance belong-
ings of each part. Finally, a simple instance partition process is employed to obtain
the final results during inference.
6.1 Introduction
Human parsing for recognizing each semantic part (e.g., arms, legs) is one of the
most fundamental and critical tasks in analyzing humans in the wild and plays an
important role in higher level application domains such as video surveillance [1] and
human behavior analysis [2, 3].
Driven by the advance of fully convolutional networks (FCNs) [4], human parsing,
or semantic part segmentation, has recently made great progress owing to deeply
learned features [5, 6], large-scale annotations [7, 8], and advanced reasoning over
graphical models [9, 10]. However, previous approaches focus only on the single-
person parsing task in simplified and limited conditions such as fashion pictures
[11–13] with upright poses and diverse daily images [7], and disregard real-world
cases in which multiple person instances appear in one image. Such ill-posed single-
person parsing tasks severely limit the potential application of human parsing to
more challenging scenarios (e.g., group behavior prediction).
In this work, we aim to resolve the more challenging instance-level human pars-
ing task, which needs to not only segment various body parts or clothes but also
associate each part with one instance, as shown in Fig. 6.1. In addition to the diffi-

https://doi.org/10.1007/978-981-13-2387-4_6
70 6 Instance-Level Human Parsing
Fig. 6.1 Examples of large-scale “Crowd Instance-level Human Parsing (CIHP)” dataset, which
contains 38,280 multiperson images with elaborate annotations and high appearance variability
as well as complexity. The images are presented in the first row. The annotations of semantic part
segmentation and instance-level human parsing are shown in the second and third rows, respectively.
Best viewed in color
culties shared with single-person parsing (e.g., various appearances/viewpoints and

self-occlusions), instance-level human parsing is posed as a more challenging task
because the number of person instances in an image varies immensely, and this vari-
ability cannot be conventionally addressed using traditional single-person parsing
pipelines with a fixed prediction space that categorizes a fixed number of part labels.
A recent work [14] explored this task by following the “parsing-by-detection”
pipeline [15–19] that first localizes bounding boxes of instances and then performs
fine-grained semantic parsing for each box. However, such complex pipelines are
trained using several independent targets and stages for detection and segmentation,
which may lead to inconsistent results for coarse localization and pixel-wise part
segmentation. For example, segmentation models may predict semantic part regions
outside the boxes detected by detection models because their intermediate represen-
tations are dragged in different directions.
In this work, we reformulate instance-level human parsing from a new perspective,
that is, addressing two coherent segment grouping goals, part-level pixel grouping,
and instance-level part grouping, via a unified network. First, part-level pixel group-
ing can be addressed by the semantic part segmentation task, which assigns each pixel
to one part label, thus learning the categorization property. Second, given a set of
independent semantic parts, instance-level part grouping can determine the instance
belongings of all parts according to the predicted instance-aware edges, where parts
that are separated by instance edges will be grouped into distinct person instances.
We call this detection-free unified network that jointly optimizes semantic part seg-
mentation and instance-aware edge detection the part grouping network (PGN), as
illustrated in Fig. 6.4.
Moreover, in contrast to other proposal-free methods [3, 20, 21] that break the task
of instance object segmentation into several subtasks by a few separate networks and
resort to complex postprocessing, PGN seamlessly integrates part segmentation and
6.1 Introduction 71
Fig. 6.2 Two examples show that the errors of the parts and edges of challenging cases can be
seamlessly remedied by the refinement scheme in PGN. In the first row, the segmentation branch
fails to locate the small objects (e.g., the person at the top-left corner and the hand at the bottom-right
corner), but the edge branch detects them successfully. In the second row, the background edges
are mistakenly labeled. However, these incorrect results are rectified by the refinement branch of
the PGN
edge detection under a unified network that first learns shared representation and then
appends two parallel branches for semantic part segmentation and instance-aware
edge detection. As the two targets are highly correlated with each other by sharing
coherent grouping goals, PGN further incorporates a refinement branch to make the
two targets mutually benefit from each other by exploiting complementary contextual
information. This integrated refinement scheme is especially advantageous for chal-
lenging cases because it seamlessly remedies the errors from each target. As shown
in Fig. 6.2, a small person may fail to be localized by the segmentation branch but
maybe successfully detected by the edge branch, or mistakenly labeled background
edges from instance boundaries could be corrected with the refinement algorithm.
Given semantic part segmentation and instance edges, an efficient cutting inference
can be used to generate instance-level human parsing results using a breadth-first
search over line segments obtained by jointly scanning the segmentation and edges
maps.
Furthermore, to the best of our knowledge, there is no available large-scale dataset
for instance-level human parsing research. We introduce a new large-scale dataset,
named Crowd Instance-level Human Parsing (CIHP), that contains 38,280 multiper-
son images with pixel-wise annotations of 19 semantic parts at the instance level. The
dataset is elaborately annotated, focusing on the semantic understanding of multiple
people in the wild, as shown in Fig. 6.1. With the new dataset, we also propose a public
server benchmark to automatically report evaluation results for fair comparison.
Our contributions are summarized as follows. (1) We investigate more challenging
instance-level human parsing, which pushes the research boundary of human parsing
to better match real-world scenarios. (2) A novel part grouping network (PGN) is
proposed to immediately solve multiperson human parsing in a unified network by
reformulating it as twinned grouping tasks, semantic part segmentation, and instance-

aware edge detection, that can be mutually refined. (3) We build a new large-scale
benchmark for instance-level human parsing and present a detailed dataset analysis.
(4) The PGN surpasses previous methods for both semantic part segmentation and
edge detection tasks and achieves state-of-the-art performance for instance-level
human parsing on both the existing PASCAL-Person-Part dataset [13] and our new
CIHP dataset.
6.2 Related Work
Human Parsing Recently, many research efforts have been devoted to human pars-
ing [7, 11, 22–24] to advance human-centric analysis. For example, Liang et al.
[24] proposed a novel Co-CNN architecture that integrates multiple levels of image
contexts into a unified network. Gong et al. [7] designed a structure-sensitive learn-
ing to enforce semantic consistency between the produced parsing results and the
human joint structures. However, all these prior works focus only on relatively sim-
ple single-person human parsing without considering the common multiple-instance
cases in the real world.
For the current data resources, we summarize the publicly available datasets for
human parsing in Table 6.1. Previous datasets include only very few person instances
and categories in one image and require prior work to evaluate only pure part segmen-
tation performance while disregarding the instance belongings. In contrast, contain-
ing 38,280 images, the proposed CIHP dataset is the first and most comprehensive
dataset for instance-level human parsing to date. Although a few datasets exist in
the vision community that are dedicated to other tasks, e.g., clothing recognition and
retrieval [25, 26] and fashion modeling [27], our CIHP, which focuses on instance-
level human parsing, is the largest and provides more elaborate dense annotations
for diverse images. A standard server benchmark for our CIHP can facilitate human
analysis research by enabling a fair comparison among current approaches.
Instance-Level Object Segmentation Our target is also highly relevant to the
instance-level object segmentation task that aims to predict a whole mask for each
object in an image. Most of the prior works [15–18, 18, 19] addressed this task by
sequentially optimizing object detection and foreground/background segmentation.
Dai et al. [17] proposed a multiple-stage cascade to unify bounding box proposal
generation, segment proposal generation, and classification. In [14, 28], a CRF was
used to assign each pixel to an object detection box by exploiting semantic segmen-
tation maps. More recently, Mask R-CNN [19] extended the Faster R-CNN detection
framework [29] by adding a branch to predict the segmentation masks of each region
of interest. However, these proposal-based methods may fail to model the interac-
tions among different instances, which are critical for performing more fine-grained
segmentation for each instance in our instance-level human parsing.
Nonetheless, some approaches [3, 20, 21, 30–32] were also proposed to bypass
the object proposal step for instance-level segmentation. In the PFN [3], the clus-
6.2 Related Work 73
Table 6.1 Comparison of the publicly available datasets for human parsing. For each dataset, we
report the number of person instances per image; the total number of images; the separate number
of images in the training, validation, and test sets; and the number of part labels, including the
background
Dataset # Instances/ # Total # Train # Validation # Test Categories
image
Fashionista [23] 1 685 456 – 229 56
PASCAL-Person- 2.2 3,533 1,716 – 1,817 7
Part [13]
ATR [5] 1 17,700 16,000 700 1,000 18
LIP [7] 1 50,462 30,462 10,000 10,000 20
CIHP 3.4 38,280 28,280 5,000 5,000 20
tering of the number of instances and per-pixel bounding boxes was predicted to
produce instance segmentation. In [21], semantic segmentation and object boundary
prediction were exploited to separate instances by a complicated image partitioning
formulation. Similarly, the SGN [20] proposed predicting object breakpoints to cre-
ate line segments, which were then grouped into connected components to generate
object regions. Despite their intuition being similar to ours in grouping regions to
generate an instance, these two pipelines separately learn several subnetworks and
thus obtain the final results by relying on a few independent steps.
Here, we emphasize that this work investigates a more challenging fine-grained
instance-level human parsing task that integrates the current semantic part segmenta-
tion and instance-level object segmentation tasks. From the technical perspective, we
present a novel detection-free part grouping network that unifies and mutually refines
twinned grouping tasks, semantic part segmentation, and instance-aware edge detec-
tion, in an end-to-end way. Without the expensive CRF refinement used in [14], the
final results can then be effortlessly obtained by a simple instance partition process.
6.3 Crowd Instance-Level Human Parsing Dataset
To benchmark the more challenging multiperson human parsing task, we build a

large-scale dataset called the Crowd Instance-level Human Parsing (CIHP) dataset,
which has several appealing properties. First, with 38,280 diverse human images, it is
the largest multiperson human parsing dataset to date. Second, CIHP is annotated with
rich information of person items. The images in this dataset are labeled with pixel-
wise annotations on 20 categories and instance-level identification. Third, the images
are collected from real-world scenarios and contain people appearing in challenging
poses, from various viewpoints, with heavy occlusions and various appearances and
in a wide range of resolutions. Some examples are shown in Fig. 6.1. With the CIHP
dataset, we propose a new benchmark for instance-level human parsing together with
a standard evaluation server, where the test set will be kept secret to avoid overfitting.
6.3.1 Image Annotation
The images in the CIHP are collected from unconstrained resources such as Google
and Bing. We manually specify several keywords (e.g., family, couple, party, meeting)
to gain a great diversity of multiperson images. The crawled images are elaborately
annotated by a professional labeling organization with good quality control. We
supervise the entire annotation process and conduct a second-round check for each
annotated image. We remove the unusable images that are of low resolution or image
quality or contain one or no person instances.
In total, 38,280 images are kept to construct the CIHP dataset. Following random
selection, we arrive at a unique split that consists of 28,280 training and 5,000 vali-
dation images with publicly available annotations as well as 5,000 test images with
annotations withheld for benchmarking purposes.
6.3.2 Dataset Statistics
We now introduce the images and categories in the CIHP dataset with more statistical
details. Superior to the previous attempts [7, 13, 24], which average one or two person
instances per image, all images of the CIHP dataset contain two or more instances,
with an average of 3.4. The distribution of the number of persons per image is
illustrated in Fig. 6.3 (left). Generally, we follow LIP [7] to define and annotate the
Fig. 6.3 Left: Statistics on the number of persons in one image. Right: The data distribution of the
19 semantic part labels in the CIHP dataset
6.3 Crowd Instance-Level Human Parsing Dataset 75
semantic part labels. However, we find that the jumpsuit label defined in LIP [7]
is infrequent compared to the other labels. For more complete and precise human
parsing, we use a more common body part label (torso-skin) instead. Therefore, the
19 semantic part labels in the CIHP are hat, hair, sunglasses, upper clothing, dress,
coat, socks, pants, gloves, scarf, skirt, torso-skin, face, right/left arm, right/left leg,
and right/left shoe. The numbers of images for each semantic part label are presented
in Fig. 6.3 (right).
6.4 Part Grouping Network
In this section, we present a general pipeline for our approach (see Fig. 6.4) and
then describe each component in detail. The proposed part grouping network (PGN)
jointly trains and refines the semantic part segmentation and instance-aware edge
detection in a unified network. Technically, these two subtasks are both pixel-wise
classification problems, on which fully convolutional networks (FCNs) [4] perform
well. Our PGN is thus constructed based on the FCN structure, which first learns
common representation using shared intermediate layers and then appends two par-
allel branches for semantic part segmentation and edge detection. To explore and
take advantage of the semantic correlation of these two tasks, a refinement branch
is further incorporated to make the two targets mutually beneficial for each other
by exploiting complementary contextual information. Finally, an efficient partition
process with a heuristic grouping algorithm can be used to generate instance-level
Fig. 6.4 Illustration of our part grouping network (PGN). Given an input image, we use ResNet-
101 to extract the shared feature maps. Then, two branches are appended to capture part context and
human boundary context while simultaneously generating part score maps and edge score maps.
Finally, a refinement branch is performed to refine both predicted segmentation maps and edge
maps by integrating part segmentation and human boundary contexts
Fig. 6.5 The whole pipeline of our approach to instance-level human parsing. Generated from the
PGN, the part segmentation maps and edge maps are scanned simultaneously to create horizontal
and vertical segmented lines. Similar to a connected graph problem, the breadth-first search can
be applied to group the segmented lines into regions. Furthermore, the small regions near the
instance boundary are merged into their neighbor regions to cover larger areas and several part
labels. Associating the instance maps and part segmentation maps, the pipeline finally outputs a
well-predicted instance-level human parsing result without any proposals from object detection
human parsing results using a breadth-first search over line segments obtained by
jointly scanning the generated semantic part segmentation maps and instance-aware
edge maps.
6.4.1 PGN Architecture
Backbone Subnetwork Basically, we use a repurposed ResNet-101 network,

DeepLab-v2 [10], as our human feature encoder because of its demonstrated high
performance in dense prediction tasks. It employs convolution with upsampled fil-
ters, or “atrous convolution,” which effectively enlarges the field of view of the
filters to incorporate a larger context without increasing the number of parameters
or the amount of computation. The coupled problems of semantic segmentation and
edge detection share several key properties that can be efficiently learned by a few
shared convolutional layers. Intuitively, they both desire satisfying dense recognition
according to low-level contextual cues from nearby pixels and high-level semantic
information for better localization. In this way, instead of training two separate net-
works to handle these two tasks, we create a single backbone network that allows
weight sharing for learning common feature representation.
However, in the original DeepLab-v2 architecture [10], an input image is down-
sampled by two different ratios (0.75 and 0.5) to produce multiscale inputs at three
different resolutions, which are independently processed by ResNet-101 using shared
weights. The output feature maps are then upsampled and combined by taking the
element-wise maximum. This multiscale scheme requires enormous memory and
is time consuming. Alternatively, we use single-scale input and employ two more
efficient and powerful coarse-to-fine schemes. First, inspired by skip architecture [4]
that combines semantic information from a deep, coarse layer with appearance infor-
6.4 Part Grouping Network 77
mation from a shallow, fine layer to produce accurate and detailed segmentation,
we concatenate the activations of the final three blocks of ResNet-101 as the final
extracted feature maps. Owing to the atrous convolution, this information combi-
nation allows the network to make local predictions instructed by global structure
without upscale operation. Second, following PSPNet [33], which exploits the capa-
bility of global context information by different region-based context aggregation,
we use the pyramid pooling module on top of the extracted feature maps before the
final classification layers. The extracted feature maps are average-pooled with four
different kernel sizes, giving us four feature maps with spatial resolutions of 1 × 1,
2 × 2, 3 × 3, and 6 × 6. Each feature map undergoes convolution and upsampling
before they are concatenated with each other. Benefiting from these two coarse-to-
fine schemes, the backbone subnetwork is able to capture contextual information that
has different scales and varies among different subregions.
Semantic Part Segmentation Branch. The common technique [10, 34] for
semantic segmentation is to predict the image at several different scales with shared
network weights and then combine the predictions together with the learned attention
weights. To reinforce the efficiency and generalizability of our unified network, we
discard the multiscale input and apply another context aggregation pattern with var-
ious average-pooling kernel sizes, which is introduced in [33]. We append one side
branch to perform pixel-wise recognition for assigning each pixel to one semantic
part label. The 1 × 1 convolutional classifiers output K channels, corresponding to
the number of target part labels, including a background class.
Instance-Aware Edge Detection Branch. Following [35], we attach side out-
puts for edge detection to the final three blocks of ResNet-101. Deep supervision is
imposed at each side-output layer to learn rich hierarchical representations of edge
predictions. In particular, we use atrous spatial pyramid pooling (ASPP) [10] for the
three edge side output layers to robustly detect boundaries at multiple scales. The
ASPP that we use consists of one 1 × 1 convolution and four 3 × 3 atrous convolu-
tions with dilation rates of 2, 4, 8, and 16. In the final classification layers for edge
detection, we use a pyramid pooling module to collect more global information for
better reasoning. We apply 1 × 1 convolutional layers with one channel for all edge
outputs to generate edge score maps.
Refinement Branch. We design a simple yet efficient refinement branch to jointly
refine segmentation and edge predictions. As shown in Fig. 6.4, the refinement branch
integrates the segmentation and edge predictions back into the feature space by
mapping them to a larger number of channels with an additional 1 × 1 convolution.
The remapped feature maps are combined with the extracted feature maps from both
the segmentation branch and edge branch, which are finally fed into another two
pyramid pooling modules to mutually boost segmentation and edge results.
In summary, the learning objective of the PGN can be written as

N
L = α · (L seg + L seg ) + β · (L edge + L edge + L nside ). (6.1)
n=1
The resolution of the output score maps is m × m, which is the same for both seg-
mentation and edge. Thus, the segmentation branch has a K m 2 -dimensional output,
which encodes K segmentation maps of resolution m × m, one for each of the K
classes. During training, we apply a per-pixel softmax and define L seg as the multino-
mial cross-entropy loss. L seg is the same but for the refined segmentation results. For
each m 2 -dimensional edge output, we use a per-pixel sigmoid binary cross-entropy
loss. L edge , L edge , and L nside denote the loss of the first predicted edge, refined edge,
and side-output edge, respectively. In our network, the number of edge side outputs,
N, is 3. α and β are the balance weights.
We use the batch normalization parameters provided by [10], which are fixed
during our training process. Our modules (including the ASPP and pyramid pooling
module) added to ResNet eliminate batch normalization because the whole network
is trained end-to-end with a small batch size due to the limitation of physical memory
on GPU cards. The ReLU activation function is applied following each convolutional
layer except the final classification layers.
6.4.2 Instance Partition Process
Because the tasks of semantic part segmentation and instance-aware edge detection
are able to incorporate all the information required for instance-level human parsing,
we employ a simple instance partition process to obtain the final results during
inference, which groups human parts into instances based on edge guidance. The
entire process is illustrated in Fig. 6.5.
First, inspired by the line decoding process in [20], we simultaneously scan part
segmentation maps and edge maps thinned by nonmaximal suppression [35] to create
horizontal and vertical line segments. To create horizontal lines, we slide from left to
right along each row. The background positions of the segmentation maps are directly
skipped, and a new line starts when we hit a foreground label of segmentation. The
lines are terminated when we hit an edge point, and a new line should start at the
next position. We label each new line with an individual number, so the edge points
can cut off the lines and produce a boundary between two different instances. We
perform similar operations but slide from top to bottom to create vertical lines.
The next step is to aggregate these two types of lines to create instances. We
can treat the horizontal lines and vertical lines jointly as a connected graph. The
points in the same lines can be thought of as connected because they have the same
labeled number. We traverse the connected graph by the breadth-first search to find
connected components. In detail, when visiting a point, we search its connected
neighbors horizontally and vertically and then push them into the queue that stores
the points belonging to the same regions. As a result, the lines of the same instance
are grouped, and different instance regions are separated.
This simple process inevitably introduces errors if there are false edge points
within instances, resulting in many small regions in the area around instance bound-
6.4 Part Grouping Network 79
aries. We further design a grouping algorithm to address this issue. In rethinking the
separated regions, if a region contains several semantic part labels and covers a large
area, it must be a person instance. In contrast, if a region is small and contains only
one part segmentation label, we can certainly judge it to be an erroneously separated
region and then merge it with its neighbor instance region. We treat a region as a
person instance if it contains at least two part labels and covers an area over 30 pixels,
which works best in our experiments.
Following this instance partition process, person instance maps can be generated
directly from semantic part segmentation and instance-aware edge maps.
6.5 Experiments
6.5.1 Experimental Settings
Training Details: We use the basic structure and network settings provided by
DeepLab-v2 [10]. The 512 × 512 inputs are randomly cropped from the images
during training. The size of the output scope maps, m, equals 64 with a downsam-
pling scale of 1/8. The number of categories, K , is 7 for the PASCAL-Person-Part
dataset [13] and 20 for our CIHP dataset.
The initial learning rate is 0.0001, the parsing loss weight α is 1, and the edge
loss weight β is 0.01. Following [36], we employ a “poly” learning rate policy in
which the initial learning rate is multiplied by (1 − max_iter
iter
)power with power = 0.9.
We train all models with a batch size of 4 images and momentum of 0.9.
We apply data augmentation, including randomly scaling the input images (from
0.5 to 2.0), randomly cropping and randomly left-right flipping during training for
all datasets. As reported in [14], the baseline methods Holistic [14] and MNC [17]
are pretrained on the Pascal VOC Dataset [37]. For fair comparisons, we train the
PGN with the same settings for roughly 80 epochs.
Our method is implemented by extending the TensorFlow framework. All net-
works are trained on four NVIDIA GeForce GTX 1080 GPUs.
Inference: During testing, the resolution of every input is consistent with that of
the original image. We average the predictions produced by the part segmentation
branch and the refinement branch as the final results for part segmentation. For
edge detection, we use only the results of the refinement branch. To stabilize the
predictions, we perform inference by combining the results of the multiscale inputs
and left-right flipped images. In particular, the scale is 0.5 to 1.75 in increments
of 0.25 for segmentation and from 1.0 to 1.75 for edge detection. In the partition
process, we break the lines when the activation of the edge point is larger than 0.2.
Evaluation Metric: The standard intersection over union (IoU) criterion is
adopted for evaluation of semantic part segmentation, following [13]. To evaluate
instance-aware edge detection performance, we use the same measures for traditional
edge detection [38]: fixed contour threshold (ODS) and per-image best threshold
Table 6.2 Comparison of instance-aware edge detection performance on the PASCAL-Person-Part

dataset [13]
Method ODS OIS
RCF [38] 38.2 39.8
CEDN [39] 38.9 40.1
HED [35] 39.6 41.3
PGN (edge) 41.8 43.0
PGN (w/o refinement) 42.1 43.5
PGN 42.5 43.9
Table 6.3 Comparison of A P r at various IoU thresholds for instance-level human parsing on the
PASCAL-Person-Part dataset [13]
Method IoU threshold r
A Pvol
0.5 0.6 0.7
MNC [17] 38.8 28.1 19.3 36.7
Holistic [14] 40.6 30.4 19.1 38.4
PGN (edge + segmentation) 36.2 25.9 16.3 35.6
PGN (w/o refinement) 39.1 29.3 19.5 37.8
PGN (w/o grouping) 37.1 28.2 19.3 38.2
PGN (large-area grouping) 37.6 28.7 19.7 38.6
PGN 39.6 29.9 20.0 39.2
(OIS). In terms of instance-level human parsing, we define the metrics by drawing

inspiration from the evaluation of instance-level semantic segmentation. Specifically,
we adopt mean average precision, referred to as A P r [15]. We also compare the mean
of the A P r score for overlap thresholds varying from 0.1 to 0.9 in increments of 0.1,
r
noted as A Pvol [14].
6.5.2 PASCAL-Person-Part Dataset
We first evaluate the performance of our PGN on the PASCAL-Person-Part

dataset [13], with 1,716 images for training and 1,817 for testing. Following [34, 40],
the annotations are merged to include six person part classes, head, torso, upper/lower
arms, and upper/lower legs, and one background class.
Comparison on Instance-aware Edge Detection We report the statistical com-
parison of our PGN and state-of-the-art methods on instance-aware edge detection
in Table 6.2. Our PGN shows a substantial improvement in terms of ODS and OIS.
This large improvement demonstrates that edge detection can benefit from semantic
part segmentation in our unified network.
6.5 Experiments 81
Table 6.4 Performance comparison of edges (left), part segmentation (middle), and instance-level
human parsing (right) from different components of the PGN on the CIHP
Method ODS OIS Mean IoU threshold r
A Pvol
IoU
0.5 0.6 0.7
PGN (edge) + PGN (segmentation) 44.8 44.9 50.7 28.5 22.9 16.4 27.8
PGN (w/o refinement) 45.3 45.6 54.1 33.3 26.3 18.5 31.4
PGN (w/o grouping) – – – 34.7 27.8 20.1 32.9
PGN (large-area grouping) – – – 35.1 28.2 20.4 33.4
PGN 45.5 46.0 55.8 35.8 28.6 20.5 33.6
Comparison on Instance-level Human Parsing Table 6.3 shows the results of a

comparison of instance-level human parsing with two baseline methods [14, 17] that
rely on the object detection framework to generate a large number of proposals for
separating instances. Our PGN method achieves state-of-the-art performance, espe-
cially in terms of high IoU threshold, owing to the smoother segmentation boundaries
refined by edge context. This result verifies the rationality of our PGN based on the
assumption that semantic part segmentation and edge detection together can directly
depict the key characteristics for achieving good capability in instance-level human
parsing. The joint feature learning scheme in the PGN also enables the part-level
grouping by semantic part segmentation and instance-level grouping by instance-
aware edge detection to mutually benefit from each other by seamlessly incorporating
multilevel contextual information.
6.5.3 CIHP Dataset
As there are no available codes for the baseline methods [14], we extensively evaluate
each component of our PGN architecture on the CIHP test set, as shown in Table 6.4.
For part segmentation and instance-level human parsing, the performances on CIHP
is worse than those on PASCAL-Person-Part [13] because the CIHP dataset contains
more instances with more diverse poses, appearance patterns, and occlusions, which
is more consistent with real-world scenarios, as shown in Fig. 6.6. However, the
images in CIHP are high quality with higher resolution, which leads to better edge
detection results.
6.5.4 Qualitative Results
The qualitative results on the PASCAL-Person-Part dataset [13] and the CIHP dataset
are shown in Fig. 6.6. Compared to the results of Holistic [14], our part segmentation
(a)
(b)
Fig. 6.6 Left: Visualized results on the PASCAL-Person-Part dataset [13]. In each group, the first
line shows the input image, segmentation and instance results of Holistic [14] (provided by the
authors), and the results of our PGN are presented in the second line. Right: The images and the
predicted results of edges, segmentation, and instance-level human parsing by our PGN on the CIHP
dataset are presented vertically
and instance-level human parsing results are more precise because the predicted edges
can eliminate the interference from the background, such as the flag in group (a) and
the dog in group (b). Overall, our PGN outputs highly semantically meaningful
predictions owing to the mutual refinement of edge detection and semantic part
segmentation.
References
1. L. Wang, X. Ji, Q. Deng, M. Jia, Deformable part model based multiple pedestrian detection
for video surveillance in crowded scenes, in VISAPP (2014)
2. K. Gong, X. Liang, D. Zhang, X. Shen, L. Lin, Look into person: Self-supervised structure-
sensitive learning and a new benchmark for human parsing, in CVPR (2017)
3. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in CVPR (2016)
4. Q. Li, A. Arnab, P.H. Torr, Holistic, instance-level human parsing. arXiv preprint
arXiv:1709.03612 (2017)
5. B. Hariharan, P. Arbeláez, R. Girshick, J. Malik, Simultaneous detection and segmentation, in
ECCV (2014)
6. X. Liang, Y. Wei, X. Shen, Z. Jie, J. Feng, L. Lin, S. Yan, Reversible recursive instance-level
object segmentation, in CVPR (2016)
7. J. Dai, K. He, J. Sun, Instance-aware semantic segmentation via multi-task network cascades,
in CVPR (2016)
8. P.O. Pinheiro, R. Collobert, P. Dollár, Learning to segment object candidates, in NIPS (2015)
9. K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask r-cnn, in ICCV (2017)
10. S. Liu, J. Jia, S. Fidler, R. Urtasun, Sgn: Sequential grouping networks for instance segmenta-
tion, in ICCV (2017)
11. A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, C. Rother, Instancecut: from edges to
instances with multicut, in CVPR (2017)
References 83
12. E. Simo-Serra, S. Fidler, F. Moreno-Noguer, R. Urtasun, A High Performance CRF Model for
Clothes Parsing, in ACCV (2014)
13. Z. Liu, P. Luo, S. Qiu, X. Wang, X. Tang, Deepfashion: Powering robust clothes recognition
and retrieval with rich annotations, in CVPR (2016)
14. M. Hadi Kiapour, X. Han, S.L.A.C.B.T.L.B.: Where to buy it:matching street clothing photos
in online shops, in ICCV (2015)
15. Simo-Serra, E., Fidler, S., Moreno-Noguer, F., Urtasun, R.: Neuroaesthetics in fashion: mod-
eling the perception of fashionability, in CVPR (2015)
16. A. Arnab, P.H.S. Torr, Pixelwise instance segmentation with a dynamically instantiated net-
work, in CVPR (2017)
17. S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region
proposal networks, in NIPS (2015)
18. M. Ren, R.S. Zemel, End-to-end instance segmentation with recurrent attention, in CVPR
(2017)
19. M. Bai, R. Urtasun, Deep watershed transform for instance segmentation, in CVPR (2017)
20. B. Romera-Paredes, P.H.S. Torr, Recurrent instance segmentation, in ECCV (2016)
21. H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in CVPR (2017)
22. M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual object
classes (voc) challenge, in IJCV (2010)
23. S. Xie, Z. Tu, Holistically-nested edge detection, in ICCV (2015)
24. L.C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic
image segmentation. arXiv preprint arXiv:1706.05587 (2017)
25. Y. Liu, M.M. Cheng, X. Hu, K. Wang, X. Bai, Richer convolutional features for edge detection,
in CVPR (2017)
26. J. Yang, B. Price, S. Cohen, H. Lee, M.H. Yang, Object contour detection with a fully convo-
lutional encoder-decoder network, in CVPR (2016)
27. C. Gan, M. Lin, Y. Yang, G. de Melo, A.G. Hauptmann, Concepts not alone: Exploring pairwise
relationships for zero-shot video activity recognition, in AAAI (2016)
28. X. Liang, Y. Wei, X. Shen, J. Yang, L. Lin, S. Yan, Proposal-free network for instance-level
object segmentation. arXiv preprint arXiv:1509.02636 (2015)
30. H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M, Bernstein, A.C. Berg
Olga, Russakovsky, J. Deng, L. Fei-Fei, Imagenet large scale visual recognition challenge
(2015)
32. L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Deeplab: Semantic image
segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv
preprint arXiv:1606.00915 (2016 TPAMI (2015)
33. K. Yamaguchi, M. Kiapour, T. Berg, Paper doll parsing: Retrieving similar styles to parse
clothing items, in ICCV (2013)
34. J. Dong, Q. Chen, W. Xia, Z. Huang, S. Yan, A deformable mixture parsing model with parselets,
in ICCV (2013)
35. X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, et al., Detect what you can: Detecting and
36. T. Lin, M. Maire, S.J. Belongie, L.D. Bourdev, R.B. Girshick, J. Hays, P. Perona, D. Ramanan,
P., C.L. Zitnick, Microsoft COCO: common objects in context. CoRR arXiv:1405.0312 (2014)
37. K. Yamaguchi, M. Kiapour, L. Ortiz, T. Berg, Parsing clothing in fashion photographs, in CVPR
(2012)
39. L.C. Chen, Y. Yang, J. Wang, W. Xu, A.L. Yuille, Attention to scale: Scale-aware semantic
image segmentation, in CVPR (2016)
40. F. Xia, P. Wang, L.C., A.L. Yuille, Zoom better to see clearer: Huamn part segmentation with
auto zoom net, in ECCV (2016)
Chapter 7
Video Instance-Level Human Parsing
Abstract This chapter introduces a novel Adaptive Temporal Encoding Network

(ATEN) that alternatively performs temporal encoding among key frames and flow-
guided feature propagation from other consecutive frames between two key frames.
Specifically, ATEN first incorporates a Parsin-RCNN to produce the instance-level
parsing result for each key frame, which integrates global human parsing and
instance-level human segmentation into a unified model. To balance accuracy and
efficiency, flow-guided feature propagation is used to directly parse consecutive
frames according to their identified temporal consistency with key frames. On the
other hand, ATEN leverages the convolutional gated recurrent units (convGRU) to
exploit temporal changes over a series of key frames, which are further used to facil-
itate frame-level instance-level parsing. By alternatively performing direct feature
propagation between consistent frames and temporal encoding networks among key
frames, our ATEN achieves a good balance between frame-level accuracy and time
efficiency, which is a common crucial problem in video object segmentation research.
7.1 Introduction
Due to the successful development of fully convolutional networks (FCNs) [1], great
progress has been made in human parsing, or the semantic part segmentation task [2–
8]. However, previous approaches to single-person or multiple-person human parsing
focused only on the static image domain. To bring the research closer to real-world
scenarios, fast and accurate video instance-level human parsing is more desirable
and crucial for high-level applications such as action recognition and object tracking
as well as group behavior prediction.
In this work, we make the first attempt to investigate the more challenging video
instance-level human parsing task, which needs to not only segment various body
parts or clothing but also associate each part with one instance for every frame in the
video, as shown in Fig. 7.1. In addition to the difficulties shared with single-person
parsing (e.g., various appearances, viewpoints, and self-occlusions) and instance-
level parsing (e.g., an uncertain number of instances), video human parsing faces
more challenges that are inevitable in video object detection and segmentation prob-

https://doi.org/10.1007/978-981-13-2387-4_7
86 7 Video Instance-Level Human Parsing
lems. For example, recognition accuracy suffers from deteriorated appearances in

videos that are seldom observed in still images, such as motion blur and video defo-
cus. On the other hand, the balance between frame-level accuracy and time efficiency
is also a very difficult and important factor in the deployment of diverse devices (such
as mobile devices).
7.2 Video Instance-Level Parsing Dataset
In this section, we describe our new video instance-level parsing (VIP) dataset in
detail. Sample frames of some of the sequences are shown in Fig. 7.1. To the best of
our knowledge, our VIP is the first large-scale dataset that focuses on comprehensive
human understanding to benchmark the new challenging video instance-level fine-
grained human parsing task. Containing videos collected from real-world scenarios
in which people appear in various poses, from various viewpoints and with heavy
occlusions, the VIP dataset presents the difficulties of the semantic part segmentation
task. Furthermore, it also includes all major challenges typically found in longer video
sequences such as motion blur, camera shake, out-of-view, and scale variation.
Fig. 7.1 Sample sequences from our VIP dataset with ground-truth part segmentation masks over-
laid
7.2 Video Instance-Level Parsing Dataset 87
7.2.1 Data Amount and Quality
Our data collection and annotation methodology are carefully designed to capture
the high variability of real-world human activity scenes. The sequences are collected
from YouTube with several specified keywords (e.g., dancing, flash mob) to gain
a wide variety of multiperson videos. All images are meticulously annotated by
professionals. We maintain data quality by manually inspecting and conducting a
second-round check of the annotated data. We remove the unusable images that are
of low resolution and image quality. The length of a video in the dataset ranges from
10 s to 120 s. For every 25 consecutive frames in each video, one frame is densely
annotated with pixel-wise semantic part categories and instance-level identification.
7.2.2 Dataset Statistics
To analyze every detailed region of a person, including different body parts as well
as different clothing styles, following the largest still-image human parsing LIP
dataset [8], we defined 19 usual clothing classes and body parts, hat, hair, sunglasses,
upper clothing, dress, coat, socks, pants, gloves, scarf, skirt, torso-skin, face, right/left
arm, right/left leg, and right/left shoe, for annotation. Additionally, the annotated
frames of our VIP dataset, with an average of 2.93 person instances per image, are
superior to the previous attempts [3, 8, 9], which average one or two person instances
per image.
7.3 Adaptive Temporal Encoding Network
Given a video frame sequence I j , j = 1, 2, 3..., N , the video instance-level human

parsing involves outputting each person instance and parsing each instance into more
fine-grained parts (e.g., head, leg, dress) for all frames. A baseline approach to solv-
ing this problem is to apply an image instance-level human parsing method to each
frame individually, which is simple but has poor performance (efficiency and accu-
racy) because of the lack of temporal information. First, as a baseline, we propose a
novel Parsing-RCNN to produce instance-level parsing results for each key frame,
which integrates global human parsing and instance-level human segmentation into
a unified model. In the Parsing-RCNN, a deep fully convolutional network (FCN)
is applied to the input image I to generate feature maps F = N f eat (I ). Subse-
quently, a well-designed instance-level human parsing subnetwork N par se is applied
to the extracted features to produce global human parsing as well as instance-level
human segmentation and generate the final instance-level human parsing results
R = N par se (F) by taking the union of all parts assigned to a particular instance.
Fig. 7.2 An overview of our ATEN approach, which performs adaptive temporal encoding over
key frames and flow-guided feature propagation for consecutive frames among key frames. Each
key frame (blue) is fed into a temporal encoding module that memorizes the temporal information
of its former key frames. To alleviate the computational cost, the features of consecutive frames
(green) between two key frames can be produced by the flow-guided propagation module from the
nearest key frame. Then, all feature maps of all frames are fed to the Parsing-RCNN to generate
the instance-level human parsing results
As shown in Fig. 7.2, our ATEN approach based on the Parsing-RCNN aims to
balance efficiency and accuracy by applying flow-guided feature propagation and
adaptive temporal encoding. We divide each video sequence into several segments
of equal length l, Seg = [I jl , I jl+1 , ..., I( j+1)∗l−1 ]. Only one frame in each segment
is selected to be a key frame (using the median frame as the default). Given a key
frame Ik , the encoded feature is denoted as
F k = ε(Fk−2 , Fk−1 , Fk ). (7.1)
Subsequently, the feature of a non-key frame It is propagated from the nearest key
frame Ik , which is denoted as
F t = W(F k , Mt→k , St→k ), (7.2)
where M and S are the flow field and scale field, respectively. Finally, the instance-
level human parsing subnetwork N par se is applied to both encoded key frame feature
maps and warped non-key frame feature maps to compute the eventual result R =
N par se (F).
As shown in Fig. 7.3, given an encoding range p, which specifies the range of
the former key frames for encoding ( p = 2 by default), we first apply the embedded
FlowNet F [10] to individually estimate p flow fields and scale fields, which are used
7.3 Adaptive Temporal Encoding Network 89
Fig. 7.3 Our adaptive temporal encoding module. For each key frame K , we first obtained warped
feature maps from two previous key frames (i.e., K − 1 and K − 2) via a flow-guided propagation
module. Then, the warped features and current appearance features are consecutively fed to con-
vGRU for temporal encoding. All feature maps in this module have the same shape (stride of 4,256
dimensions)
for warping (as illustrated in Sect. 7.3.1) p former key frames to current key frames.
Fk− j→k = W(Fk− j , Mk→k− j , Sk→k− j ), j∈[1, p]. (7.3)
After feature warping, each warped feature is consecutively fed to convGRU for
temporal coherence feature encoding. We use the last state of GRU as the encoded
feature.
F k = convG RU (Fk− p→k , ..., Fk−1→k , Fk ). (7.4)
ConvGRU is an extension of traditional GRU [11] that has convolutional structures

instead of fully connected structures. Equation (7.5) illustrates the operations inside
a GRU unit. The new state h t is a weighted combination of the previous state h t−1 and
the candidate memory h t . The update gate z t determines how much of this memory
is incorporated into the new state. The reset gate rt controls the influence of the
previous state h t−1 on the candidate memory h t .
z t = σ(xt ∗ wx z + h t−1 ∗ whz + bz ),

rt = σ(xt ∗ wxr + h t−1 ∗ whr + br ),
(7.5)
h t = tanh(xt ∗ wxh + rt h t−1 ∗ whh + bh ),
h t = (1 − z t )h t−1 + z t h t .
In contrast to traditional GRU, ∗ here represents a convolutional operation.

denotes element-wise multiplication. σ is the sigmoid function. w is learned trans-
formations, and b is bias terms.
7.3.1 Flow-Guided Feature Propagation
Motivated by [12, 13], given a reference frame I j and a target Ii frame, an optical
flow field is calculated by the embedded FlowNet F [10, 14] to obtain a pixel-wise
motion path. Extending the FlowNet with a scale field that has the same spatial and
channel dimensions as the feature maps helps to improve the flow warping accuracy.
The feature propagation function is defined as
F j→i = W(F j , F (Ii , I j ))S, (7.6)
where F j denotes the deep feature of reference frame I j . W denotes the bilinear
sampler function. denotes element-wise multiply. F represents the flow estimation
function, and S is the scale field that refines the warped feature. FlowNet-S [10] is
adopted as the flow estimation function and pretrained on the FlyingChairs dataset.
A scale map with the same dimensions as the target features is predicted in parallel
with the flow field by FlowNet via an additional 1×1 convolutional layer attached
to the top feature of the flow network. The weights of the extra 1×1 convolutional
layer are initialized with zeros. The biases are initialized with ones and frozen during
the training phase. The whole process is fully differentiable, which has been clearly
described in [12, 13].
7.3.2 Parsing R-CNN
Following a simple and effective framework of image instance segmentation, Mask-

RCNN [15], we extend it to a Parsing-RCNN by adding a global human parsing
branch to predict semantic fine-grained part segmentations in parallel with the orig-
inal Mask-RCNN branch.
As shown in Fig. 7.4, given a feature map F generated by a fully convolutional
network [1, 2, 16, 17], the entire process is as follows. On the one hand, the instance-
level human segmentation branch integrates a region proposal network (RPN) and
applies it on F to propose candidate object bounding boxes and ROIAlign to extract
region-of-interest (ROI) features and perform classification, bounding-box regres-
sion, and binary mask estimation. On the other hand, in the global human parsing
branch, we apply multirate atrous convolution on F to predict semantic fine-grained
part segmentation, such as DeepLab [2]. Taking these two results (human instance
segmentation and semantic fine-grained part segmentation) into consideration, we
can easily obtain instance-level human parsing results.
Fig. 7.4 Our Parsing-RCNN module for instance-level human parsing. Feature maps extracted by
the backbone network are simultaneously passed through the instance-level human segmentation
branch and global human parsing branch, and the results are then integrated to obtain the final
instance-level human parsing results by taking the union of all parts assigned to a particular human
instance
Formally, during training, we define a multitask loss on both the whole image and
each ROI as
L = L par sing + Lcls + Lbox + Lmask . (7.7)
L par sing is the image global parsing loss, which is defined as softmax cross-entropy
loss. Specifically, Lcls , Lbox , and Lmask are calculated on each ROI. The global
parsing branch and the instance-level human segmentation branch are jointly trained
to minimize L by stochastic gradient descent (SGD).
7.3.3 Training and Inference
Training. Our ATEN is fully differentiable and can be trained end-to-end. A standard
image domain method can be transferred to video tasks by selecting a proper task-
specific subnetwork. During the training phase, in each minibatch, video frames
{Ik− p , ..., Ik , It }, −l/2 ≤ t − k < l − l/2 are randomly sampled and fed to the
network. In the forward pass, N f eat is applied on frames except for It . After obtaining
the encoded feature F k , feature F k is propagated to F t . Otherwise, the feature maps
are identical and passed through N par se directly. Finally, N par se is applied on F t or
F k . Because all the components are differentiable, the multitask loss, as illustrated
in Eq. 7.7, can backpropagate to all subnetworks to optimize task performance.
Inference. Algorithm 1 summarizes the inference algorithm. Given a video frame
sequence I , a segment length l, and an encoding range p, the proposed method
sequentially processes each segment. Only one frame is selected as the key frame in
each segment. A fully convolutional network is applied on key frame Ik to extract
feature Fk . Then, it searches the p former key frames and feeds them into the adaptive
temporal encoding module with the current key frame. When there are not enough
former key frames, the p latter key frames are selected instead. Subsequently, these
key frames are warped to the current key frame via a flow-guided propagation module
and consecutively fed to convGRU for temporal coherence feature encoding. With
the encoded feature F k , other non-key frames F t featured in this segment can be
obtained by the flow-guided feature propagation module. Finally, the Parsing-RCNN
module is applied on F k or F t to obtain instance-level parsing results.
Regarding runtime complexity, the ratio of our method versus the single-frame
baseline is as follows:
O(G RU ) + (l + p) × (O(W) + O(F )) + l × O(N par se ) + O(N f eat )

r= ,
l × (O(N f eat ) + O(N par se ))
(7.8)
where O() measures the function complexity. In each segment of length l, compared
with frame-level baseline taking l times N f eat and N par se , our method takes only
one times costly N f eat . As both N f eat and F have considerable complexity, we have
O(G RU ), O(W), O(N par se )
O(F ) < O(N f eat ).
Algorithm 7.1: Inference algorithm of adaptive temporal encoding network

Input: video frames sequence {I }, key frame duration length l, encoding rang p
for k in [1, N ] do
Fk = N f eat (Ik )
end
for k in [1, N ] do
for i in [1, p] do
Fk−i→k = W(Fk−i , F (Ik , Ik−i ))S
end
F k = convG RU (Fk− p→k , ..., Fk−1→k , Fk )
for j in [−l/2, l − l/2) do
if j = 0 then
rk = N par se (F k )
F k+ j = W(F k , F (Ik+ j , Ik ))S rk+ j = N par se (F k+ j ) /* task-specific
subnetwok */
end
end
Output: instance-level human parsing results {r }
Thus, the ratio in Eq. (7.8) is approximated as
(l + p) × O(F ) 1
r= + < 1. (7.9)
l × N f eat l
In fact, the encoding range p is smaller (e.g., 1, 2), and the backbone fully convo-
lutional network has higher time complexity than FlowNet. Our approach with high
accuracy achieves a faster speed than the per-frame baseline.
References

2. L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Deeplab: Semantic image
segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv
preprint arXiv:1606.00915 (2016) TPAMI (2015)
4. X. Liang, X. Shen, J. Feng, L. Lin, S. Yan, Semantic object parsing with graph lstm, in ECCV
(2016)
5. X. Liang, H. Zhou, E. Xing, Dynamic-structured semantic propagation network, in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 752–761 (2018)
6. X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, S. Yan, Deep human parsing with
active template regression. TPAMI (2015)
8. K. Gong, X. Liang, D. Zhang, X. Shen, L. Lin, Look into person: self-supervised structure-
sensitive learning and a new benchmark for human parsing, in CVPR (2017)
9. X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, et al., Detect what you can: Detecting and
10. A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, Van Der Smagt, Patrick, D.
Cremers, T. Brox, Flownet: Learning optical flow with convolutional networks, in Proceedings
of the IEEE international conference on computer vision, pp. 2758–2766 (2015)
11. K. Cho, V. Merriënboer, Bart, C. Gulcehre, D. Bahdanau, B. Fethi, S. Holger, B. Yoshua,
Learning phrase representations using RNN encoder-decoder for statistical machine translation,
arXiv:1406.1078,2014
12. X. Zhu, Y. Wang, J. Dai, L. Yuan, Y. Wei, Flow-guided feature aggregation for video object
detection, in Proceedings of the IEEE International Conference on Computer Vision, pp. 408–
417 (2017)
13. X. Zhu, Y. Xiong, J. Dai, L. Yuan, Y. Wei, Deep feature flow for video recognition, in Pro-
(2017)
14. E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox, FlowNet 2.0: Evolution of
optical flow estimation with deep networks, in IEEE Conference on Computer Vision and
15. K. He, G. Gkioxari, P. Dollar, R. Girshick, Mask r-cnn, in ICCV (2017)
16. L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-Decoder with Atrous Sepa-
rable Convolution for Semantic Image Segmentation (2018)
17. L.C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic
image segmentation. arXiv preprint arXiv:1706.05587 (2017)
Part IV
Identifying and Verifying Persons
Person verification involves person reidentification and face recognition (in this
chapter, we focus on face verification in different modalities, i.e., faces from still
images and videos, older and younger faces, and sketch and photo portraits).
Person reidentification (ReID), which aims to match pedestrian images across
multiple nonoverlapping cameras, has attracted increasing attention in surveillance.
Most recent works can be categorized into three groups: (1) extracting invariant and
discriminant features [1–4], (2) learning a robust metric or subspace for matching
[1, 5–8], and (3) joint learning of the above two methods [9–11]. Recently, deep
learning [4] and video-based models [12] have also been introduced for ReID. There
are also works on the generalization of ReID, e.g., [13, 14]. Recently, GAN [15] was
also introduced to boost the performance of ReID. Zheng et al. [16] adopts DCGAN
in unlabeled data generation and effectively improves the discriminative ability of the
baseline. Zhong et al. [17] propose two camera-style adaptation methods for same-
source mapping and unsupervised domain adaptation. Deng et al. [18] introduce
similarity-preserving GAN (SPGAN) to learn image transition from the source to
the target domain in an unsupervised manner. Despite considerable efforts, ReID is
still an open problem due to the dramatic variations in viewpoint and pose changes.
Despite the great advances in face-related research in recent years, face recognition
across age remains a challenging problem. The challenges include large intrasubject
variation and great intersubject similarity [19]. The human facial appearance changes
greatly with the aging process. From birth to adulthood, the greatest change is cran-
iofacial growth, which involves a change in shape; from adulthood to old age, the
most perceptible change is skin aging, which involves a texture change [20]. Such
changes in the same person are intrasubject variations. Meanwhile, different persons
in the same age period may look similar, which is intersubject similarity. Therefore,
reducing intrasubject variations while increasing intersubject differences is a crucial
goal in metric-based age-invariant recognition. Several traditional approaches, such
as linear discriminant analysis (LDA) [21], Bayesian face recognition [22, 23], met-
ric learning [24], and recent deep learning methods [25], have realized this goal for
general face recognition.
Sketch-photo face verification is an interesting yet challenging task that aims to
verify whether a photo of a face and a drawn sketch of a face both portray the same
individual. This task has an important application in assisting law enforcement, i.e.,
96 Part IV: Identifying and Verifying Persons
using a face sketch to find candidate face photos. However, it is difficult to match
photos and sketches in two different modalities. For example, hand-drawing may
create unpredictable facial distortion and variation compared to a photo, and face
sketches often lack details that can be important cues for preserving identity. Many
attempts have been made to verify faces between sketches and photos. For example, a
local-based strategy proposed by Xiao et al. [26] was based on the embedded hidden
Markov model (E-HMM). The researchers transformed the sketches into pseudo-
photos and applied the eigenface algorithm for recognition. Zhang et al. [27] added
a refinement step to the existing approaches by applying a support vector regression
(SVR)-based model to synthesize high-frequency information. Similarly, Gao et al.
[28] proposed a new method called SNS-SRE with two steps, i.e., sparse neigh-
bor selection (SNS) to obtain an initial estimation and sparse-representation-based
enhancement (SRE) for further improvement. To capture person identity during the
photo-sketch transformation, [29] defined an optimization objective in the form of
joint generative-discriminative minimization. In particular, a discriminative regular-
ization term is incorporated into the photo-sketch generation, enhancing the discrim-
inability of the generated person sketches in relation to sketches of other individuals
and thus boosting the capacity of both photo-sketch generation and face-sketch ver-
ification.
Matching person faces across still images and videos is a newly emerging task in
intelligent visual surveillance. In these applications, still images (e.g., ID photos) are
usually captured in a controlled environment, while faces in surveillance videos are
filmed in complex scenarios (e.g., with various lighting conditions and occlusions and
in low resolutions). Several cross-domain methods have been proposed to address
the still-to-video face recognition problem [30]. However, their performances are
still poor.
References
1. D. Gray, H. Tao, Viewpoint invariant pedestrian recognition with an ensemble of

localized features, Computer Vision–ECCV 2008, pp. 262–275 (2008)
2. M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person re-
identification by symmetry-driven accumulation of local features, in Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on (IEEE, 2010),
pp. 2360–2367
3. R. Zhao, W. Ouyang, X. Wang, Learning mid-level filters for person re-
identification, in Proceedings of the IEEE Conference on Computer Vision and
4. W. Li, R. Zhao, T. Xiao, X. Wang, Deepreid: deep filter pairing neural network
for person re-identification, in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 152–159 (2014)
References 97
5. M. Kostinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale met-

ric learning from equivalence constraints, in Proceedings IEEE Conference on
Computer Vision and Pattern Recognition, pp. 2288–2295 (2012)
6. B. J. Prosser, W.-S. Zheng, S. Gong, T. Xiang, Q. Mary, Person re-identification
by support vector ranking. in BMVC, vol. 2, no. 5, p. 6 (2010)
7. W.-S. Zheng, S. Gong, T. Xiang, Person re-identification by probabilistic relative
distance comparison, in Computer Vision and Pattern Recognition (CVPR), 2011
IEEE Conference on (IEEE, 2011), pp. 649–656
8. S. Ding, L. Lin, G. Wang, H. Chao, Deep feature learning with relative dis-
tance comparison for person re-identification. Pattern Recogn. 48(10), 2993–
3003 (2015)
9. L. Lin, G. Wang, W. Zuo, X. Feng, L. Zhang, Cross-domain visual matching via
generalized similarity measure and feature learning. IEEE Trans. Pattern Anal.
Mach. Intell. 39(6), 1089–1102 (2017)
10. T. Xiao, H. Li, W. Ouyang, X. Wang, Learning deep feature representations with
domain guided dropout for person re-identification, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1249–1258 (2016)
11. G. Wang, L. Lin, S. Ding, Y. Li, Q. Wang, Dari: distance metric and representation
integration for person verification. in AAAI, pp. 3611–3617 (2016)
12. J. You, A. Wu, X. Li, W.-S. Zheng, Top-push video-based person re-identification,
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, pp. 1345–1353 (2016)
13. T. Xiao, S. Li, B. Wang, L. Lin, X. Wang, End-to-end deep learning for person
search, arXiv preprint arXiv:1604.01850 (2016)
14. S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, X. Wang, Person search with natural
language description, arXiv preprint arXiv:1702.05729 (2017)
15. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, Y. Bengio, Generative adversarial nets, in NIPS, pp. 2672–2680
(2014)
16. Z. Zheng, L. Zheng, Y. Yang, Unlabeled samples generated by gan improve the
person re-identification baseline in vitro, in ICCV (2017)
17. Z. Zhong, L. Zheng, S. Li, Y. Yang, Generalizing a person retrieval model hetero-
and homogeneously, in ECCV (2018)
18. W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, J. Jiao, Image-image domain
adaptation with preserved self-similarity and domain-dissimilarity for person
reidentification, in CVPR (2018)
19. Y. Li, G. Wang, L. Lin, H. Chang, A deep joint learning approach for age invariant
face verification, in CCF Chinese Conference on Computer Vision (Springer,
2015), pp. 296–305
20. Y. Fu, G. Guo, T.S. Huang, Age synthesis and estimation via faces: a survey, in
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no.
11, pp. 1955–1976 (2010)
21. P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces versus fisherfaces:
recognition using class specific linear projection, Yale University New Haven
United States, Tech. Rep. (1997)
98 Part IV: Identifying and Verifying Persons
22. B. Moghaddam, T. Jebara, A. Pentland, Bayesian face recognition. Pattern

Recogn. 33(11), 1771–1782 (2000)
23. D. Chen, X. Cao, L. Wang, F. Wen, J. Sun, Bayesian face revisited: A joint
formulation, in European Conference on Computer Vision (Springer, 2012), pp.
566–579
24. M. Guillaumin, J. Verbeek, C. Schmid, Is that you? metric learning approaches for
face identification, in ICCV 2009-International Conference on Computer Vision
(IEEE, 2009), pp. 498–505
25. Y. Li, G. Wang, L. Nie, Q. Wang, W. Tan, Distance metric optimization driven
convolutional neural network for age invariant face recognition. Pattern Recogn.
75, 51–62 (2018)
26. B. Xiao, X. Gao, D. Tao, X. Li, A new approach for face recognition by sketches
in photos. Signal Process. 89(8), 1576–1588 (2009)
27. J. Zhang, N. Wang, X. Gao, D. Tao, X. Li, Face sketch-photo synthesis based on
support vector regression, in Image Processing (ICIP), 2011 18th IEEE Interna-
tional Conference on (IEEE, 2011), pp. 1125–1128
28. X. Gao, N. Wang, D. Tao, X. Li et al., Face sketch-photo synthesis and retrieval
using sparse representation, IEEE Transactions on Circuits and Systems for Video
Technology, vol. 22, no. 8, pp. 1213–1226 (2012)
29. L. Zhang, L. Lin, X. Wu, S. Ding, L. Zhang, End-to-end photo-sketch generation
via fully convolutional representation learning, in Proceedings of the 5th ACM on
International Conference on Multimedia Retrieval (ACM, 2015), pp. 627–634
30. Z. Huang, S. Shan, H. Zhang, S. Lao, A. Kuerban, X. Chen, Benchmarking
still-to-video face recognition via partial and local linear discriminant analysis
on cox-s2v dataset, in Asian Conference on Computer Vision. Springer, pp. 589–
600 (2012)
Chapter 8
Person Verification
Abstract Cross-domain visual data matching is one of the fundamental problems

in many real-world vision tasks, e.g., matching persons across ID photos and surveil-
lance videos. Conventional approaches to this problem usually involve two steps: (i)
projecting samples from different domains into a common space and (ii) computing
(dis)similarity in this space based on a certain distance. In this paper, we present a
novel pairwise similarity measure that advances the existing models by (i) expand-
ing traditional linear projections into affine transformations and (ii) fusing affine
Mahalanobis distance and cosine similarity in a data-driven combination. More-
over, we unify our similarity measure with feature representation learning via deep
convolutional neural networks. Specifically, we incorporate the similarity measure
matrix into the deep architecture, enabling an end-to-end method of model optimiza-
tion. We extensively evaluate our generalized similarity model in several challenging
cross-domain matching tasks: person reidentification in different views and face ver-
ification in different modalities (i.e., faces from still images and videos, older and
younger faces, and sketch and photo portraits). The experimental results demonstrate
the superior performance of our model compared to other state-of-the-art methods
(c
[2019] IEEE. Reprinted, with permission, from [1]).
8.1 Introduction
In this chapter, we formulate person verification problems as cross-domain visual

matching problems. The literature describes coping with the cross-domain matching
of visual data by learning a common space for different domains. CCA [2] learns
the common space via maximizing cross-view correlation, while PLS [3] learns
via maximizing cross-view covariance. Coupled information-theoretic encoding is
proposed to maximize the mutual information [4]. Another conventional strategy is to
synthesize samples from the input domain into the other domain. Rather than learning
the mapping between two domains in the data space, dictionary learning [5, 6] can be
used to alleviate cross-domain heterogeneity, and semicoupled dictionary learning
(SCDL) [6] is proposed to model the relationship on the sparse coding vectors from
the two domains. Duan et al. propose another framework called a domain adaptation

https://doi.org/10.1007/978-981-13-2387-4_8
100 8 Person Verification
machine (DAM) [7] for multiple source domain adaptation, but this approach requires
a set of pretrained base classifiers.
Various discriminative common space approaches have been developed by uti-
lizing label information. Supervised information can be employed by the Rayleigh
quotient [2], which treats the label as the common space [8], or by employing the
max-margin rule [9]. Using the SCDL framework, structured group sparsity has
been adopted to utilize label information [5]. The generalization of discriminative
common space to multiple views has also been studied [10]. Kan et al. propose a
multiview discriminant analysis (MvDA) [11] method to obtain a common space for
multiple views by optimizing both the interview and intraview Rayleigh quotients.
In [12], a method to learn to shape models using local curve segments with multiple
types of distance metrics is proposed.
For most existing multiview analysis methods, the target is defined based on
the standard inner product or distance between the samples in the feature space. In
the field of metric learning, several generalized similarity/distance measures have
been studied to improve recognition performance. In [13, 14], the generalized dis-
tance/similarity measures are formulated as the difference between the distance com-
ponent and the similarity component to take into account both the cross-inner-product
term and two norm terms. Li et al. [15] adopt the second-order decision function as
a distance measure without considering the positive semidefinite (PSD) constraint.
Chang and Yeung [16] suggest an approach to learning locally smooth metrics using
local affine transformations while preserving the topological structure of the origi-
nal data. These distance/similarity measures, however, were developed for matching
samples from the same domain, and they cannot be directly applied to cross-domain
data matching.
To extend traditional single-domain metric learning, Mignon and Jurie [17] sug-
gest a cross-modal metric learning (CMML) model, which learns domain-specific
transformations based on a generalized logistic loss. Zhai et al. [18] incorporate joint
graph regularization into a heterogeneous metric learning model to improve the cross-
media retrieval accuracy. In [17, 18], Euclidean distance is adopted to measure the
dissimilarity in the latent space. Instead of explicitly learning domain-specific trans-
formations, Kang et al. [19] learn a low-rank matrix to parameterize the cross-modal
similarity measure by the accelerated proximal gradient (APG) algorithm. However,
these methods are based mainly on common similarity or distance measures, and
none of them addresses the feature learning problem in cross-domain scenarios.
Instead of using handcrafted features, learning feature representations and contex-
tual relations with deep neural networks, especially the convolutional neural network
(CNN) [20], have shown great potential in various pattern recognition tasks such as
object recognition [21] and semantic segmentation [22]. Significant performance
gains have also been achieved in face recognition [23] and person reidentification
[24–27] that are mainly attributable to the progress in deep learning. Recently, several
deep CNN-based models have been explored for similarity matching and learning.
For example, Andrew et al. [28] propose a multilayer CCA model consisting of
several stacked nonlinear transformations. Li et al. [29] learn filter pairs via deep
networks to handle misalignment and photometric and geometric transformations
8.1 Introduction 101
and achieve promising results for the person reidentification task. Wang et al. [30]
learn fine-grained image similarity with a deep ranking model. Yi et al. [31] present
a deep metric learning approach by generalizing the Siamese CNN. Ahmed et al.
[25] propose a deep convolutional architecture to measure the similarity between a
pair of pedestrian images. In addition to the shared convolutional layers, their net-
work includes a neighborhood difference layer and a patch summary layer to compute
cross-input neighborhood differences. Chen et al. [26] propose a deep ranking frame-
work to learn the joint representation of an image pair and return the similarity score
directly in which the similarity model is replaced by full connection layers.
Our deep model is partially motivated by the above works, but we target a more
powerful solution to cross-domain visual matching by incorporating a generalized
similarity function into deep neural networks. Moreover, our network architecture is
different from those presented in the existing works, leading to new state-of-the-art
results for several challenging person verification and recognition tasks.
Cross-domain visual data matching, e.g., matching persons across ID photos and
surveillance videos, is one of the fundamental problems in many real-world vision
tasks. Conventional approaches to this problem usually involve two steps: (i) pro-
jecting samples from different domains into a common space and (ii) computing
(dis)similarity in this space based on a certain distance. In this section, we present a
novel pairwise similarity measure that advances the existing models by (i) expand-
ing traditional linear projections into affine transformations and (ii) fusing affine
Mahalanobis distance and cosine similarity in a data-driven combination. More-
over, we unify our similarity measure with feature representation learning via deep
convolutional neural networks. Specifically, we incorporate the similarity measure
matrix into the deep architecture, enabling an end-to-end method of model optimiza-
tion. We extensively evaluate our generalized similarity model in several challenging
cross-domain matching tasks: person reidentification in different views and face ver-
ification in different modalities (i.e., faces from still images and videos, older and
younger faces, and sketch and photo portraits). The experimental results demonstrate
the superior performance of our model compared to other state-of-the-art methods.
8.2 Generalized Similarity Measures
Visual similarity matching is arguably one of the most fundamental problems in

computer vision and pattern recognition, and this problem becomes more challenging
when dealing with cross-domain data. For example, in still-video face retrieval,
a newly emerging task in visual surveillance, faces from still images captured in a
constrained environment are utilized as queries to find matches of the same identity in
unconstrained videos. Age-invariant and sketch-photo face verification tasks are also
examples of cross-domain image matching. Conventional approaches (e.g., canonical
correlation analysis [2] and partial least squares regression [3]) for cross-domain
matching usually follow a two-step procedure:
(1) Samples from different modalities are first projected into a common space
by learning a transformation. The computation may be simplified by assuming that
these cross-domain samples share the same projection.
(2) A certain distance is then utilized to measure the similarity in the projection
space. Usually, Euclidean distance or inner product distance is used.
Suppose that x and y are two samples of different modalities, and U and V are
two projection matrices applied to x and y, respectively. Ux and Vy are usually
formulated as linear similarity transformations mainly for convenient optimization.
A similarity transformation has a useful property of preserving the shape of an
object that undergoes this transformation, but it is limited in capturing complex
deformations that usually exist in various real problems, e.g., translation, shearing,
and composition. On the other hand, Mahalanobis distance, cosine similarity, and
combinations of the two have been widely studied in the research on similarity metric
learning, but how to unify feature learning and similarity learning, in particular, how
to combine Mahalanobis distance with cosine similarity and integrate the distance
metric into deep neural networks for end-to-end learning, remains less investigated.
To address the above issues, in this work, we present a more general similarity
measure and unify it with deep convolutional representation learning. One of the key
innovations is that we generalize two aspects of the existing similarity models. First,
we extend the similarity transformations Ux and Vy to the affine transformations
by adding a translation vector to them, i.e., replacing Ux and Vy with LA x + a
and LB y + b, respectively. Affine transformation is a generalization of similarity
transformation without the requirement of preserving the original point in a linear
space, and it is able to capture more complex deformations. Second, in contrast to the
traditional approaches that choose either Mahalanobis distance or cosine similarity,
we combine these two measures in the affine transformation. This combination is
realized in a data-driven fashion, as discussed in the Appendix, resulting in a novel
generalized similarity measure, defined as
⎡ ⎤⎡ ⎤
A C d x
S(x, y) = [xT yT 1] ⎣CT B e ⎦ ⎣y⎦ , (8.1)
dT eT f 1
where submatrices A and B are positive semidefinite, representing the self-

correlations of the samples in their own domains, and C is a correlation matrix
crossing the two domains.
Figure 8.1 intuitively explains this idea. In this example, it is observed that
Euclidean distance in the linear transformation, as (a) illustrates, can be regarded as
a special case of our model with A = UT U, B = VT V, C = −UT V, d = 0, e = 0,
and f = 0. Our similarity model can be viewed as a generalization of several recent
metric learning models [13, 15]. The experimental results show that the introduc-
tion of (d, e, f ) and a more flexible setting of (A, B, C) significantly improve the
matching performance.
Another innovation of this work is that we unify feature representation learning
and similarity measure learning. In the literature, most of the existing models are per-
8.2 Generalized Similarity Measures 103
(a)
(b)
Fig. 8.1 Illustration of the generalized similarity model. Conventional approaches project data by
simply using linear similarity transformations (i.e., U and V), as illustrated in (a), where Euclidean
distance is applied as the distance metric. As illustrated in (b), we improve the existing models by
(i) expanding the traditional linear similarity transformation into an affine transformation and (ii)
fusing Mahalanobis distance and cosine similarity. The case in (a) is a simplified version of our
model. Please refer to the Appendix for the deduction details
formed in the original data space or in a predefined feature space; that is, the feature
extraction and the similarity measure are studied separately. These methods may have
several drawbacks in practice. For example, the similarity models rely heavily on fea-
ture engineering and thus lack generalizability when applied to problems in different
scenarios. Moreover, the interaction between the feature representations and similar-
ity measures is ignored or simplified, thus limiting their performance. Meanwhile,
deep learning, especially the convolutional neural network (CNN), has demonstrated
its effectiveness in learning discriminative features from raw data and has benefited
from building end-to-end learning frameworks. Motivated by these works, we build
a deep architecture to integrate our similarity measure into CNN-based feature repre-
sentation learning. Our architecture takes raw images from different modalities as the
inputs and automatically produces representations of these images by sequentially
stacking shared subnetworks upon domain-specific subnetworks. Upon these layers,
we further incorporate the components of our similarity measure by stimulating them
with several appended structured neural network layers. The feature learning and the
similarity model learning are thus integrated for end-to-end optimization.
8.2.1 Model Formulation
According to the discussion in Sect. 8.2, our generalized similarity measure extends
the traditional linear projection and integrates Mahalanobis distance and cosine sim-
ilarity into a generic form, as shown in Eq. (8.1). As shown in the Appendix, A
and B in our similarity measure are positive semidefinite, but C does not obey this
constraint. Hence, we can further factorize A, B and C as follows:
A = LA T LA ,
B = LB T LB , (8.2)
xT y
C= −LC LC .
Moreover, our model extracts feature representation (i.e., f1 (x) and f2 (y)) from
the raw input data by utilizing the CNN. Incorporating the feature representation and
the above matrix factorization into Eq. (8.1), we thus obtain the following similarity
model:
S̃(x, y) = ⎡ S(f1⎤(x),
⎡ f2 (y))⎤
A C d f1 (x)
= [f1 (x)T f2 (y)T 1] ⎣CT B e ⎦ ⎣f2 (y)⎦
(8.3)
dT eT f 1
= LA f1 (x) + LB f2 (y)2 + 2dT f1 (x)
2
y
− 2(LC x
f1 (x))T (LC f2 (y)) + 2eT f2 (y) + f.
Specifically, LA f1 (x), LC x
f1 (x), and dT f1 (x) can be regarded as the similarity
y
components for x, while LB f2 (y), LC f2 (y), and dT f2 (y) correspondingly represent
y. These similarity components are modeled as the weights that connect the neurons
of the last two layers. For example, a portion of the output activations represents
LA f1 (x) by taking f1 (x) as the input and multiplying the corresponding weights LA .
Below, we discuss the formulation of our similarity learning.
The objective of our similarity learning is to seek a function S̃(x, y) that satisfies
a set of similarity/dissimilarity constraints. Instead of learning a similarity func-
tion in a handcrafted feature space, we take the raw data as input and introduce a
deep similarity learning framework to integrate nonlinear feature learning and gen-
eralized similarity learning. Recall that our deep generalized similarity model is
shown in Eq. (8.1). (f1 (x) and f2 (y)) are the feature representations for samples
from different modalities, and we use W to indicate their parameters. We denote
y
= (LA , LB , LC x
, LC , d, e, f ) as the similarity components for sample matching.
Note that S̃(x, y) is asymmetric, i.e., S̃(x, y) = S̃(y, x). This is reasonable for cross-
domain matching because the similarity components are domain-specific.
Assume that D = {({xi , yi }, i )}i=1 N
is a training set of cross-domain sample pairs,
where {xi , yi } denotes the ith pair, and i denotes the corresponding label of {xi , yi }
indicating whether xi and yi are from the same class:
8.2 Generalized Similarity Measures 105

−1, c(x) = c(y)
i = (xi , yi ) = , (8.4)
1, otherwise
where c(x) denotes the class label of the sample x. An ideal deep similarity model
is expected to satisfy the following constraints:

< −1, if i = −1
S̃(xi , yi ) (8.5)
≥ 1, otherwise
for any {xi , yi }.

Note that a feasible solution that satisfies the above constraints may not exist.
To avoid this scenario, we relax the hard constraints in Eq. (8.5) by introducing a
hinge-like loss:
N
G(W, ) = (1 − i S̃(xi , yi ))+ . (8.6)
i=1
To improve the stability of the solution, some regularizers are also introduced, result-
ing in our deep similarity learning model:

N
ˆ = arg min
(Ŵ, ) (1 − i S̃(xi , yi ))+ + (W, ), (8.7)
W,
i=1
where (W, ) = λW2 + μ2 denotes the regularizer on the parameters of

the feature representation and generalized similarity models.
8.2.2 Connection with Existing Models
Our generalized similarity learning model is a generalization of many existing metric

learning models, which can be treated as special cases of our model by imposing
extra constraints on (A, B, C, d, e, f ).
A conventional similarity model is usually defined as SM (x, y) = xT My, and this
form is equivalent to our model when A = B = 0, C = 21 M, d = e = 0, and f = 0.
Similarly, the Mahalanobis distance DM (x, y) = (x − y)T M(x − y) is also regarded
as a special case of our model when A = B = M, C = −M, d = e = 0, and f = 0.
Below, we connect our similarity model to two state-of-the-art similarity learning
methods, i.e., LADF [15] and joint Bayesian [13].
In [15], Li et al. propose learning a decision function that jointly models a distance
metric and a locally adaptive thresholding rule, and the so-called locally adaptive
decision function (LADF) is formulated as a second-order large-margin regulariza-
tion problem. Specifically, LADF is defined as
F(x, y) = xT Ax + yT Ay + 2xT Cy + dT (x + y) + f. (8.8)

One can observe that F(x, y) = S(x, y) when we set B = A and e = d in our model.
It should be noted that LADF treats x and y using the same metrics, i.e., A for both
xT Ax and yT Ay, and d for dT x and dT y. Such a model is reasonable for matching
samples with the same modality but may be unsuitable for cross-domain matching
where x and y are from different modalities. Compared with LADF, our model uses
A and d to calculate xT Ax and dT x and uses B and e to calculate yT By and eT y,
making our model more effective for cross-domain matching.
In [13], Chen et al. extend the classical Bayesian face model by learning a joint dis-
tribution (i.e., intraperson and extraperson variations) of sample pairs. Their decision
function is expressed in the following form:
J (x, y) = xT Ax+yT Ay − 2xT Gy. (8.9)
Note that the similarity metric model proposed in [14] adopts a similar form. Inter-
estingly, this decision function is also a special variant of our model if we set B = A,
C = −G, d = 0, e = 0, and f = 0.
In summary, our similarity model can be regarded as a generalization of many
existing cross-domain matching and metric learning models; therefore, it is more
flexible and suitable than those models for cross-domain visual data matching.
8.3 Joint Similarity and Feature Learning
In this section, we introduce our deep architecture that integrates the generalized
similarity measure with convolutional feature representation learning.
8.3.1 Deep Architecture
As discussed above, our model defined in Eq. (8.7) jointly addresses similarity func-
tion learning and feature learning. This integration is achieved by building a deep
architecture of convolutional neural networks, which is illustrated in Fig. 8.2. It is
worth mentioning that our architecture is able to handle input samples from different
modalities with unequal numbers, e.g., 20 samples of x and 200 samples of y are fed
into the network as a batch process.
From left to right in Fig. 8.2, two domain-specific subnetworks, g1 (x) and g2 (y),
are applied to samples from two different modalities. Then, the outputs of g1 (x) and
g2 (y) are concatenated into a shared subnetwork f(·). We superpose g1 (x) and g2 (y)
to feed f(·). At the output of f(·), the feature representations of the two samples
are extracted separately as f1 (x) and f2 (y), as indicated by the slice operator in
Fig. 8.2. Finally, these learned feature representations are utilized in the structured
fully connected layers that incorporate the similarity components defined in Eq. (8.3).
Below, we introduce the detailed setting of the three subnetworks.
8.3 Joint Similarity and Feature Learning 107
Fig. 8.2 Deep architecture of our similarity model. This architecture comprises three parts: a
domain-specific subnetwork, a shared subnetwork and a similarity subnetwork. The first two parts
extract feature representations from samples from different domains, which are built upon a number
of convolutional layers, max-pooling operations, and fully connected layers. The similarity subnet-
work contains two structured fully connected layers that incorporate the similarity components in
Eq. (8.3)
Domain-specific Subnetwork. We separate two neural network branches to

address the samples from different domains. Each network branch includes one con-
volutional layer with 3 filters of size 5 × 5 and a stride step of 2 pixels. The rectified
nonlinear activation is utilized. Then, we apply one max-pooling operation with a
size of 3 × 3 and a stride step of 3 pixels.
Shared Subnetwork. For this component, we stack one convolutional layer and
two fully connected layers. The convolutional layer contains 32 filters of size 5 × 5,
and the filter stride step is set as 1 pixel. The kernel size of the max-pooling operation
is 3 × 3, and its stride step is 3 pixels. The output vectors of the two fully connected
layers are of 400 dimensions. We further normalize the output of the second fully
connected layer before it is fed into the next subnetwork.
Similarity Subnetwork. In this subnetwork, a slice operator, which partitions the
vectors into two groups corresponding to the two domains, is first applied. For the
example in Fig. 10.2, 220 vectors are grouped into two sets, i.e., f1 (x) and f2 (y), with
sizes of 20 and 200, respectively. f1 (x) and f2 (y) are both 400 dimensions. Then, f1 (x)
and f2 (y) are fed into two branches of the neural network, and each branch contains
a fully connected layer. We divide the activations of these two layers into six parts
according to the six similarity components. As shown in Fig. 10.2, in the top branch,
the neural layer connects to f1 (x) and outputs LA f1 (x), LC x
f1 (x), and dT f1 (x). In
y
the bottom branch, the layer outputs LB f2 (y), LC f2 (y), and eT f2 (y) by connecting
to f2 (y). In this way, the similarity measure is tightly integrated with the feature
representations, and they can be jointly optimized during the model training. Note
that f is a parameter of the generalized similarity measure in Eq. (8.1). Experiments
show that the value of f affects only the learning convergence and not the matching
performance. Thus, we empirically set f = −1.9 in our experiments.
In the deep architecture, the similarity components of x and those of y do not

interact with each other by factorization until the final aggregation calculation; that
is, computing the components of x is independent of y. This leads to a good property
of efficient matching. In particular, we can precompute the feature representation
and the corresponding similarity components of each sample stored in a database,
and the similarity matching in the testing stage will then be very fast.
8.3.2 Model Training
In this section, we discuss the learning method for our similarity model training. To
avoid loading all images into memory, we use the minibatch learning approach; that
is, in each training iteration, a subset of the image pairs is fed into the neural network
for model optimization.
For notation simplicity in discussing the learning algorithm, we start by introduc-
ing the following definitions:

x̃ = [ LA f1 (x) LC
x
f1 (x) dT f1 (x) ]T ,
y
(8.10)
ỹ = [ LB f2 (y) LC f2 (y) eT f2 (y) ]T ,
where x̃ and ỹ denote the output layer activation of samples x and y. Prior to incor-
porating Eq. (8.10) into the similarity model in Eq. (8.3), we introduce three trans-
formation matrices (using Matlab representations):

P1 = Ir ×r 0r ×(r +1) ,
P2 = 0r ×r Ir ×r 0r ×1 , (8.11)
T
p3 = 01×2r 11×1 ,
where r equals the dimension of the output of the shared neural network (i.e., the
dimensions of f (x) and f (y)), and I indicates the identity matrix. Then, our similarity
model can be rewritten as
S̃(x, y) = (P1 x̃)T P1 x̃ + (P1 ỹ)T P1 ỹ − 2(P2 x̃)T P2 ỹ

(8.12)
+2p3T x̃ + 2p3T ỹ + f .
By incorporating Eq. (8.12) into the loss function Eq. (8.6), we obtain the follow-
ing objective:
G(W, ; D)

N
= { 1 − i [ (P1xi )T P1
xi + (P1 yi − ,
yi )T P1 (8.13)
i=1
2(P2
xi )T P2
yi + 2p3T yi + f ] }+
xi + 2p3T
where the summation term denotes the hinge-like loss for the cross-domain sample
pair {x̃i , ỹi }, N is the total number of pairs, W represents the feature representation of
different domains and represents the similarity model. W and are both embedded
as weights connecting neurons of layers in our deep neural network model, as Fig. 8.2
illustrates.
The objective function in Eq. (8.13) is defined in sample-pair-based form. To
optimize it using SGD, a certain scheme should be applied to generate mini-
batches of the sample pairs, which is usually associated with high computation
and memory costs. Note that the sample pairs in training set D are constructed
from the original set of samples from different modalities Z = {{X}, {Y}}, where
X = {x1 , ..., x j , ..., x Mx } and Y = {y1 , ..., y j , ..., y My }. The superscript denotes the
sample index in the original training set, e.g., x j ∈ X = {x1 , ..., x j , ..., x Mx } and
y j ∈ Y = {y1 , ..., y j , ..., y My }, while the subscript denotes the index of the sample
pairs, e.g., xi ∈ {xi , yi } ∈ D. Mx and My denote the total number of samples from dif-
ferent domains. Without loss of generalizability, we define z j = x j and z Mx + j = y j .
For each pair {xi , yi } in D, we have z ji,1 = xi and z ji,2 = yi with 1 ≤ ji,1 ≤ Mx and
Mx + 1 ≤ ji,2 ≤ Mz (= Mx + My ). We also have z ji,1 =
xi and
z ji,2 =
yi .
Therefore, we rewrite Eq. (8.13) in a sample-based form:
L(W, ; Z)

N
= { 1 − i [ (P1
z ji,1 )T P1
z ji,1 + (P1 z ji,2 − ,
z ji,2 )T P1 (8.14)
i=1
2(P2
z ) P2
ji,1 T
z ji,2
+ 2p3T
z ji,1 + 2p3T
z ji,2 + f ] }+
Given = (W, ), the loss function in Eq. (8.7) can also be rewritten in the sample-
based form:
H () = L(; Z) + (). (8.15)
The objective in Eq. (8.15) can be optimized by the minibatch backpropagation

algorithm. Specifically, we update the parameters by gradient descent:
∂
=−α H (), (8.16)
∂
where α denotes the learning rate. The key problem of solving the above equation is
∂
calculating ∂ L(). As discussed in [32], there are two ways to achieve this solution,
i.e., pair-based gradient descent and sample-based gradient descent. Here, we adopt
the latter to reduce the computation and memory costs.
Suppose a minibatch of training samples {z j1,x , ..., z jnx ,x , z j1,y , ..., z jny ,y } from the
original set Z, where 1 ≤ ji,x ≤ Mx and Mx + 1 ≤ ji,y ≤ Mz . Following the chain
rule, calculating the gradient for all pairs of samples is equivalent to summing up the
gradient for each sample:
∂ ∂ L ∂ z̃ j
L() = , (8.17)
∂ j
∂ z̃ j ∂
where j can be either ji,x or ji,y .

Using z ji,x as an example, we first introduce an indicator function 1z ji,x (z ji,y )
before calculating the partial derivative of output layer activation for each sample
∂L
∂ z̃ ji,x
. Specifically, we define 1z ji,x (z ji,y ) = 1 when {z ji,x , z ji,y } is a sample pair and
ji,x , ji,y S̃(z ji,x , z ji,y ) < 1. Otherwise, we let 1z ji,x (z ji,y ) = 0. ji,x , ji,y , indicating where
z ji,x and z ji,y are from the same class. With 1z ji,x (z ji,y ), the gradient of z̃ ji,x can be
written as
∂L
=− 21z ji,x (z ji,y ) ji,x , ji,y (P1T P1 z̃ ji,x − P2T P2 z̃ ji,y + p3 ). (8.18)
∂ z̃ i,x
j
j i,y
The calculation of ∂ ∂z̃ jLi,y can be conducted similarly. The algorithm for calculating the
partial derivative of output layer activation for each sample is shown in Algorithm
8.1.
Algorithm 8.1: Calculate the derivative of the output layer activation for each
sample
Input:
The output layer activation for all samples
Output:
The partial derivatives of the output layer activation for all the samples
1: for each sample z j do
2: Initialize the partner set M j containing the sample z j with M j = ∅;
3: for each pair {xi , yi } do
4: if pair {xi , yi } contains the sample z j then
5: if pair {xi , yi } satisfies i S̃(xi , yi ) < 1 then
6: Mi ← {Mi , the corresponding partner of z j in {xi , yi }};
7: end if
8: end if
9: end for
10: Compute the derivatives for the sample z j with all the partners in M j , and sum these
derivatives to be the desired partial derivative for sample z j ’s output layer activation
using Eq. (8.18);
11: end for
Note that all three subnetworks in our deep architecture are differentiable. We
can easily use the backpropagation procedure [20] to compute the partial derivatives
with respect to the hidden layers and model parameters . We summarize the overall
procedure of deep generalized similarity measure learning in Algorithm 8.2.
If all possible pairs are used in training, the sample-based form allows us to
generate n x × n y sample pairs from a minibatch of n x + n y . On the other hand,
the sample-pair-based form may require 2n x n y samples or less to generate n x ×
n y sample pairs. In gradient computation, from Eq. (8.18), for each sample, we
only require calculating P1T P1 z̃ ji,x once and P2T P2 z̃ ji,y n y times in the sample-based
form. In the sample-pair-based form, P1T P1 z̃ ji,x and P2T P2 z̃ ji,y should be computed n x
and n y times, respectively. In sum, the sample-based form generally results in less
computation and memory cost.
Algorithm 8.2: Generalized Similarity Learning

Input:
Training set, initialized parameters W and , learning rate α, t ← 0
Output:
Network parameters W and
1: while t <= T do
2: Sample training pairs D;
3: Feed the sampled images into the network;
4: Perform a feed-forward pass for all the samples and compute the net activations for
each sample zi ;
5: Compute the partial derivative of the output layer activation for each sample by
Algorithm 1.
6: Compute the partial derivatives of the hidden layer activations for each sample
following the chain rule;
7: Compute the desired gradients ∂ ∂ H ( ) using the backpropagation procedure;
8: Update the parameters using Eq. (8.16);
9: end while
8.4 Experiments
Person reidentification, which aims to match pedestrian images across multiple

nonoverlapping cameras, has attracted increasing attention in surveillance. Although
considerable efforts have been made, it is still an open problem due to dramatic varia-
tions in viewpoint and pose changes. To evaluate this task, the CUHK03 [29] dataset
and CUHK01 [33] dataset are adopted in our experiments.
Results on CUHK03. We compare our approach with several state-of-the-art
methods, which can be grouped into three categories. First, we adopt five distance
metric learning methods based on fixed feature representation, i.e., information-
theoretic metric learning (ITML) [4], local distance metric learning (LDM) [34], large
margin nearest neighbors (LMNN) [35], learning-to-rank (RANK) [36], and kernel-
based metric learning (KML) [24]. Following their implementation, the handcrafted
features of dense color histograms and dense SIFT uniformly sampled from patches
are adopted. Second, three methods especially designed for person reidentification are
employed in the experiments: SDALF [37], KISSME [38], and eSDC [39]. Moreover,
several recently proposed deep learning methods, DRSCH [40], DFPNN [29] and
IDLA [25], are also compared with our approach. DRSCH [40] is a supervised
hashing framework for integrating CNN feature and hash code learning.
(a) 1 (b) 1
0.8 0.8
identification rate
identification rate
0.6 0.6
0.4 0.4
0.2 0.2
0 0
5 10 15 20 25 30 5 10 15 20 25 30
rank rank
5.64%Euclid 10.42%RANK 10.52%Euclid 21.17%LMNN

20.65%DFPNN 5.6%SDALF 27.87%FPNN 20.61%RANK
54.74%IDLA 8.76%eSDC 65.00%IDLA 9.9%SDALF
5.53%ITML 22.0%DRSCH 17.10%ITML 22.82%eSDC
14.17%KISSME 32.7%KML 29.40%KISSME 34.30%LMLF
13.51%LDM 58.4%Ours 26.45%LDM 66.50%Ours
7.29%LMNN
Fig. 8.3 CMC curves on a the CUHK03 [29] dataset and b the CUHK01 [33] dataset for evaluating
person reidentification. Our method has superior performance compared to existing state-of-the-art
methods
The results are reported in Fig. 8.3a. It is encouraging that our approach signifi-
cantly outperforms the competing methods (e.g., improving the state-of-the-art rank-
1 accuracy from 54.74% (IDLA [25]) to 58.39%). Among the competing methods,
ITML [4], LDM [34], LMNN [35], RANK [36], KML [24], SDALF [37], KISSME
[38], and eSDC [39] are all based on handcrafted features. The superiority of our
approach in comparison to these methods should be attributed to the deployment of
both deep CNN features and the generalized similarity model. DRSCH [40], DFPNN
[29], and IDLA [25] adopt CNNs for feature representation, but their matching met-
rics are defined based on traditional linear transformations.
Results on CUHK01. Figure 8.3b shows the results of our method and of the
competing approaches on CUHK01. In addition to the methods used on CUHK03, an
additional method, i.e., LMLF [27], is used in the comparison experiment. LMLF [27]
learns midlevel filters from automatically discovered patch clusters. According to the
quantitative results, our method achieves a new state-of-the-art level of performance
with a rank-1 accuracy of 66.50%.
References
1. L. Lin, G. Wang, W. Zuo, X. Feng, L. Zhang, Cross-domain visual matching via generalized
similarity measure and feature learning, in IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 39, no. 6, pp. 1089–1102, 1 June 2017
2. D. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with
application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
3. A. Sharma, D.W. Jacobs, Bypassing synthesis: Pls for face recognition with pose, low-
resolution and sketch. Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 593–600 (2011)
References 113
4. J.V. Davis, B. Kulis, P. Jain, S. Sra, I. S. Dhillon, Information-theoretic metric learning, in

Proceedings of International Conference on Machine Learning (ACM, 2007), pp. 209–216
5. Y.T. Zhuang, Y.F. Wang, F. Wu, Y. Zhang, W. M. Lu, Supervised coupled dictionary learn-
ing with group structures for multi-modal retrieval, in Twenty-Seventh AAAI Conference on
Artificial Intelligence (2013)
6. S. Wang, D. Zhang, Y. Liang, Q. Pan, Semi-coupled dictionary learning with applications to
image super-resolution and photo-sketch synthesis, in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit, pp. 2216–2223 (2012)
7. L. Duan, D. Xu, I.W. Tsang, Domain adaptation from multiple sources: a domain-dependent
regularization approach. IEEE Trans. Neural Networks Learn. Syst. 23(3), 504–518 (2012)
8. D. Ramage, D. Hall, R. Nallapati, C. D. Manning, Labeled lda: a supervised topic model
for credit attribution in multi-labeled corpora, in Proc. Conf. Empirical Methods in Natural
Language Processing. Association for Computational Linguistics, pp. 248–256 (2009)
9. J. Zhu, A. Ahmed, E. P. Xing, Medlda: maximum margin supervised topic models for regression
and classification, in Proceedings of Int’l Conference on Machine Learning (ACM, 2009), pp.
1257–1264
10. A. Sharma, A. Kumar, H. Daume III, D.W. Jacobs, Generalized multiview analysis: a dis-
criminative latent space, in Proceedings IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2160–2167 (2012)
11. M. Kan, S. Shan, H. Zhang, S. Lao, X. Chen, Multi-view discriminant analysis, in Proceedings
European Conference on Computer (Springer, 2012), pp. 808–821
12. P. Luo, L. Lin, X. Liu, Learning compositional shape models of multiple distance metrics by
information projection (IEEE Trans. Neural Networks Learn, Syst, 2015)
13. D. Chen, X. Cao, L. Wang, F. Wen, J. Sun, Bayesian face revisited: A joint formulation, in
European Conference on Computer Vision (Springer, 2012), pp. 566–579
14. Q. Cao, Y. Ying, P. Li, Similarity metric learning for face recognition, in Proceedings Int’l
Conference on Computer Vision (IEEE, 2013), pp. 2408–2415
15. Z. Li, S. Chang, F. Liang, T.S. Huang, L. Cao, J. R. Smith, Learning locally-adaptive decision
functions for person verification, in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 3610–3617 (2013)
16. H. Chang, D.-Y. Yeung, Locally smooth metric learning with application to image retrieval, in
Proceedings International Conference on Computer Vision (2007)
17. A. Mignon, F. Jurie, Cmml: a new metric learning approach for cross modal matching, in
Proceedings Asian Conference on Computer Vision (2012)
18. X. Zhai, Y. Peng, J. Xiao, Heterogeneous metric learning with joint graph regularization for
crossmedia retrieval, in Twenty-Seventh AAAI Conference on Artificial Intelligence, June 2013
19. C. Kang, S. Liao, Y. He, J. Wang, S. Xiang, C. Pan, Cross-modal similarity learning: a low
rank bilinear formulation. Arxiv, arXiv:1411.4738 (2014)
20. Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel,
Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551
(1989)
21. A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional
neural networks, in Advances in neural information processing systems, pp. 1097–1105 (2012)
23. Y. Sun, Y. Chen, X. Wang, and X. Tang, Deep learning face representation by joint identification-
verification, in Advances in Neural Information Processing Systems, pp. 1988–1996 (2014)
24. F. Xiong, M. Gou, O. Camps, M. Sznaier, Person re-identification using kernel-based metric
learning methods, in Proceedings European Conference Computer Vision (Springer, 2014), pp.
1–16
25. E. Ahmed, M. Jones, T. K. Marks, An improved deep learning architecture for person re-
identification, in Proceedings IEEE Conference on Computer Vision and Pattern Recognition
(IEEE, 2015)
26. S.-Z. Chen, C.-C. Guo, J.-H. Lai, Deep ranking for person re-identification via joint represen-
tation learning. Arxiv, arXiv:1505.06821 (2015)
27. R. Zhao, W. Ouyang, X. Wang, Learning mid-level filters for person re-identification, in Pro-
(2014)
28. G. Andrew, R. Arora, J. Bilmes, K. Livescu, Deep canonical correlation analysis, in Proceedings
IEEE the 30th Int’l Conference Machine Learning, pp. 1247–1255 (2013)
29. W. Li, R. Zhao, T. Xiao, X. Wang, Deepreid: deep filter pairing neural network for person
re-identification, in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 152–159 (2014)
30. J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, Y. Wu, Learning fine-
grained image similarity with deep ranking, in Proceedings IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1386–1393 (2014)
31. D. Yi, Z. Lei, and S. Z. Li, Deep metric learning for practical person re-identification, arXiv
preprint arXiv:1407.4979 (2014)
32. S. Ding, L. Lin, G. Wang, H. Chao, Deep feature learning with relative distance comparison
for person re-identification. Pattern Recogn. 48(10), 2993–3003 (2015)
33. W. Li, R. Zhao, X. Wang, Human reidentification with transferred metric learning. In Proceed-
ings Asian Conference on Computer Vision, pp. 31–44 (2012)
34. M. Guillaumin, J. Verbeek, C. Schmid, Is that you? metric learning approaches for face identifi-
cation, in ICCV 2009-International Conference on Computer Vision (IEEE, 2009), pp. 498–505
35. K.Q. Weinberger, J. Blitzer, L.K. Saul, Distance metric learning for large margin nearest neigh-
bor classification, in Advances in Neural Information Processing Systems, pp. 1473–1480
(2005)
36. B. McFee, G.R. Lanckriet, Metric learning to rank, in Proc. Int’l Conference on Computer
Learning, pp. 775–782 (2010)
37. M. Farenzena, L. Bazzani, A. Perina, V. Murino, M. Cristani, Person re-identification by
symmetry-driven accumulation of local features, in Computer Vision and Pattern Recogni-
tion (CVPR), 2010 IEEE Conference on (IEEE, 2010), pp. 2360–2367
38. M. Kostinger, M. Hirzer, P. Wohlhart, P.M. Roth, H. Bischof, Large scale metric learning from
equivalence constraints, in Proceedings IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2288–2295 (2012)
39. R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-identification, in
CVPR (2013)
40. R. Zhang, L. Lin, R. Zhang, W. Zuo, L. Zhang, Bit-scalable deep hashing with regularized
similarity learning for image retrieval and person re-identification. IEEE Trans. Image Process.
24(12), 4766–4779 (2015)
Chapter 9
Face Verification
Abstract This chapter introduces a novel cost-effective framework for face identifi-
cation that progressively maintains a batch of classifiers with an increasing number of
facial images of different individuals. By naturally combining two recently emerging
techniques, active learning (AL) and self-paced learning (SPL), the proposed frame-
work is capable of automatically annotating new instances and incorporating them
into the training with weak expert recertification. The advantages of this proposed
framework are twofold: (i) the required number of annotated samples is significantly
decreased, while comparable performance is guaranteed, and user effort is also dra-
matically reduced compared to other state-of-the-art active learning techniques, and
(ii) the mixture of SPL and AL effectively improves not only the classifier accuracy
but also the robustness against noisy data compared to the existing AL/SPL methods
(c
[2019] IEEE. Reprinted, with permission, from [1]).
9.1 Introduction
With the increase in mobile phones, cameras, and social networks, a large number of
photographs has rapidly been created, especially those containing people’s faces. To
interact with these photos, there have been increasing demands for intelligent sys-
tems (e.g., content-based personal photo search-and-share applications using mobile
albums or social networks) with face recognition techniques [2–4]. Owing to several
recently proposed pose/expression normalization and alignment-free approaches [5–
7], identifying faces in the wild has achieved remarkable progress. Regarding com-
mercial products, the website “Face.com” has provided an application interface
(API) to automatically detect and recognize faces in photos. The main problem in
such scenarios is identifying individuals from images in a relatively unconstrained
environment. Traditional methods usually address this problem by supervised learn-
ing [8], and it is typically expensive and time consuming to prepare a good set of
labeled samples. Because only a few data are labeled, semisupervised learning [9]
may be a good candidate for solving this problem. However, [10] notes that due
to large numbers of noisy samples and outliers, directly using unlabeled data may
significantly reduce the learning performance.

https://doi.org/10.1007/978-981-13-2387-4_9
116 9 Face Verification
Conventional incremental face recognition methods such as incremental subspace

approaches [11, 12] often fail in complex and large-scale environments. Their per-
formances can be drastically reduced when the initial training set of facial images is
either insufficient or inappropriate. In addition, most existing incremental approaches
suffer from noisy samples or outliers in model updating. In this work, to address the
above difficulties, we propose a novel active self-paced learning framework (ASPL)
that absorbs the powers of two recently emerging techniques: active learning (AL)
[13, 14] and self-paced learning (SPL) [15–17]. In particular, our framework works
in a “cost-less-earn-more” manner, pursuing as high a high performance as possible
while reducing costs.
The basic approach of the AL method is to progressively select and annotate the
most informative unlabeled samples to improve the model; in this step, user interac-
tion is allowed. The key to AL is the sample selection criteria, which are typically
defined according to the classification uncertainty of the samples. Specifically, the
samples of low classification confidence, together with other informative criteria,
such as diversity, are generally treated as good candidates for model retraining. On
the other hand, SPL is a recently proposed learning regime that mimics the learning
process of humans/animals by gradually incorporating easy to more complex samples
into training [18, 19]; an easy sample is actually the sample with high classification
confidence in the currently trained model. Interestingly, the two learning methods
select samples with the opposite criteria. This finding inspires us to investigate the
connection between the two learning regimes and the possibility of making them
complement each other. Moreover, as noted in [4, 20], learning-based features are
able to exploit information with better discriminative ability for face recognition than
handcrafted features. We thus utilize the deep convolutional neural network (CNN)
[21, 22] for feature extraction instead of handcrafted image features. In sum, we
aim to design a cost-effective and progressive learning framework that is capable of
automatically annotating new instances and incorporating them into training with
weak expert recertification. Below, we discuss the advantages of two aspects of our
ASPL framework: “cost less” and “earn more”.
(I) Cost less: Our framework is capable of building effective classifiers with fewer
labeled training instances and less user effort than other state-of-the-art algorithms.
This level of performance is achieved by combining active learning and self-paced
learning in the incremental learning process. In certain feature spaces of model train-
ing, samples of low classification confidence are scattered and close to the classifier
decision boundary, while high-confidence samples are distributed compactly in the
intraclass regions. Our approach takes both categories of sample into consideration
for classifier updating. The benefits of this strategy are as follows: (i) High-confidence
samples can be automatically labeled and consistently added to the model training
throughout the learning process in a self-paced fashion, particularly when the clas-
sifier becomes increasingly reliable in later learning iterations. This significantly
reduces the burden of user annotations and makes the method scalable in large-scale
scenarios. (ii) Low-confidence samples are selected by allowing active user anno-
tations, enabling our approach to more efficiently select informative samples, adapt
9.1 Introduction 117
better to practical variations and converge faster, especially in the early learning stage
of training.
(II) Earn more: The mixture of self-paced learning and active learning effectively
improves not only the classifier accuracy but also the classifier robustness against
noisy samples. From the perspective of AL, extra-high-confidence samples are auto-
matically incorporated into the retraining in each iteration without human labor
costs, thus gaining faster convergence. Introducing these high-confidence samples
also contributes to suppressing noisy samples in learning due to their compactness
and consistency in the feature space. From the SPL perspective, allowing active user
intervention generates reliable and diverse samples that can avoid learning being
misled by outliers. In addition, utilizing the CNN facilitates the pursuit of higher
classification performance by learning convolutional filters instead of handcrafted
feature engineering.
In brief, our ASPL framework includes two main phases. In the initial stage,
we first learn a general face representation using a convolutional neural network
architecture and train a batch of classifiers with a very small set of annotated samples
of different individuals. In the iteration learning stage, we rank the unlabeled samples
according to how they relate to the current classifiers and retrain the classifiers by
selecting and annotating samples in either an active user query or a self-paced manner.
We can also fine-tune the CNN based on the updated classifiers.
9.2 Related Work
In this section, we first present a review of incremental face recognition and then
briefly introduce related developments in active learning and self-paced learning.
Incremental Face Recognition. There are two categories of methods address-
ing the problem of identifying faces with incremental data, namely, incremental
subspace and incremental classifier methods. The first category mainly includes
incremental versions of traditional subspace learning approaches, such as principal
component analysis (PCA) [23] and linear discriminant analysis (LDA) [12]. These
approaches map facial features into a subspace and keep the eigen representations
(i.e., eigen faces) up-to-date by incrementally incorporating new samples. In addi-
tion, face recognition is commonly accomplished by nearest neighbor-based feature
matching, which is computationally expensive when a large number of samples are
accumulated over time. On the other hand, the incremental classifier methods tar-
get updating the prediction boundary with the learned model parameters and new
samples. Exemplars include incremental support vector machines (ISVM) [24] and
online sequential forward neural networks [25]. In addition, several attempts have
been made to absorb advantages from both categories of methods. For example,
Ozawa et al. [26] proposed integrating incremental PCA with the resource alloca-
tion network in an iterative way. Although these approaches have made remarkable
progress, they suffer from low accuracy compared with batch-based state-of-the-art
face recognizers, and none of these approaches have been successfully validated on
large-scale datasets (e.g., more than 500 individuals). These approaches are basically
studied in the context of fully supervised learning; i.e., both initial and incremental
data have to be labeled.
Active Learning. This branch of research focuses mainly on actively selecting
and annotating the most informative unlabeled samples to avoid unnecessary and
redundant annotation. The key part of active learning is thus the selection strategy,
i.e., which samples should be presented to the user for annotation. One of the most
common strategies is certainty-based selection [27, 28], in which the certainties are
measured according to the predictions on new unlabeled samples obtained from the
initial classifiers. For example, Lewis et al. [27] propose taking the most uncertain
instance as the one that has the largest entropy on the conditional distribution over
its predicted labels. Several SVM-based methods [28] determine uncertain sam-
ples as those that are relatively close to the decision boundary. Sample certainty is
also measured by applying a committee of classifiers in [29]. These certainty-based
approaches usually ignore the large set of unlabeled instances and are thus sensitive
to outliers. A number of later methods present the information density measure by
exploiting the unlabeled data information when selecting samples. For example, the
informative samples are sequentially selected to minimize the generalization error
of the trained classifier on the unlabeled data based on a statistical approach [30]
or prior information [31]. In [32, 33], instances are taken to maximize the increase
of mutual information between the candidate instances and the remaining instances
based on Gaussian process models. The diversity of the selected instance over the
unlabeled data has also been taken into consideration [34]. Recently, Elhamifar
et al. [13] present a general framework via convex programming that considers both
the uncertainty and diversity measures for sample selection. However, these active
learning approaches usually emphasize low-confidence samples (e.g., uncertain or
diverse samples) while ignoring the other majority of high-confidence samples. To
enhance the discriminative capability, Wang et al. [9] propose a unified semisuper-
vised learning framework, which incorporates the high-confidence coding vectors
of unlabeled data into the training under the proposed effective iterative algorithm
and demonstrates its effectiveness in dictionary-based classification. Our work is
inspired by this study and also employs high-confidence samples to improve both
the accuracy and the robustness of classifiers.
Self-paced Learning. Inspired by the cognitive principle of humans/animals,
Bengio et al. [18] initialize the concept of curriculum learning (CL), in which a
model is learned by gradually including samples in training in a sequence from easy to
complex. To make this concept more implementable, Kumar et al. [19] substantially
implement this learning philosophy by formulating the CL principle as a concise
optimization model named self-paced learning (SPL). The SPL model includes a
weighted loss term on all samples and a general SPL regularizer imposed on sample
weights. By sequentially optimizing the model with a gradually increasing pace
parameter on the SPL regularizer, more samples can be automatically discovered
in a purely self-paced way. Jiang et al. [15, 16] provide a more comprehensive
understanding of the learning insight underlying SPL/CL and formulate the learning
model as a general optimization problem as follows:
9.2 Related Work 119

n
min vi L(w; xi , yi ) + f (v; λ)
w,v∈[0,1]n (9.1)
i=1
s.t. v ∈ ,
where D = {(xi , yi )}i=1 n

corresponds to the training dataset, L(w; xi , yi ) denotes
the loss function, which calculates the cost between the objective label yi and the
estimated label, w represents the model parameter inside the decision function, and
v = [v1 , v2 , . . . , vn ]T denote the weight variables reflecting the samples’ importance.
λ is a parameter for controlling the learning pace, which is also referred to as “pace
age”.
In the model, f (v; λ) corresponds to a self-paced regularizer. Jiang et al. abstract
three necessary conditions that should be satisfied [15, 16]: (1) f (v; λ) is convex with
respect to v ∈ [0, 1]; (2) the optimal weight of each sample should monotonically
decrease with respect to the corresponding loss; and (3) the optimal weight of each
sample should monotonically decrease with respect to the pace parameter λ.
In this axiomatic definition, Condition 2 indicates that the model is inclined to
select easy samples (with smaller errors) rather than complex samples (with larger
errors). Condition 3 states that when the model “age” λ becomes larger, the model
embarks on incorporating more, probably complex, samples to train a “mature”
model. The convexity in Condition 1 further ensures that the model can find good
solutions.
is the so-called curriculum region that encodes the information of predetermined
curricula. Its axiomatic definition contains two conditions [15]: (1) It should be
nonempty and convex; and (2) if xiranks before x j (more important for the problem)
in the curriculum, the expectation vi dv should be larger than v j dv. Condition
1 ensures the soundness of the calculation of this specific constraint, and Condition
2 indicates that samples to be learned earlier are supposed to have larger expected
values. This constraint weakly implies a prior learning sequence of samples in which
the expected value of the favored samples should be larger.
The SPL model (9.1) finely simulates the learning process of human education.
Specifically, it builds an “instructor-student collaboration” paradigm, which on the
one hand utilizes prior knowledge provided by instructors as a guide for the cur-
riculum design (encoded by the curriculum constraint) and on the other hand leaves
certain freedom to the students to adapt the actual curriculum according to their learn-
ing pace (encoded by the self-paced regularizer). Such a model not only includes all
previous SPL/CL methods as its special cases but also provides a general guideline
to extend a rational SPL implementation scheme to certain learning tasks. Based on
this framework, multiple SPL variations have recently been proposed, such as SPaR
[16], SPLD [17], SPMF [35], and SPCL [15].
SPL-related strategies have also been recently attempted in a series of applications,
such as specific-class segmentation learning [36], visual category discovery [37],
long-term tracking [38], action recognition [17], and background subtraction [35]. In
particular, the SPaR method, constructed based on the general formulation (9.1), was
applied to the challenging SQ/000Ex task of the TRECVID MED/MER competition

and achieved the leading performance among all competing teams [39].
Complementarity between AL and SPL: It is interesting that the function of
SPL is very complementary to that of AL. The SPL method emphasizes easy samples
in learning, which corresponds to the high-confidence intraclass samples, while AL
tends to select the most uncertain and informative samples, which are always located
in low-confidence areas near classification boundaries, for the learning task. SPL
is capable of easily attaining a large number of faithfully pseudo-labeled samples
with a lower human labor requirement (by the reranking technique [16], for which
we will provide details in Sect. 4), and we tend to underestimate the roles of the
most informative ones in intrinsically configuring the classification boundaries; in
contrast, AL tends to obtain informative samples, but more human labor is needed
for the careful manual annotation of these samples. We thus expect to effectively mix
these two learning schemes to help incremental learning both improve efficiency with
less human labor (i.e., cost less) and achieve better accuracy and robustness of the
learned classifier against noisy samples (i.e., earn more). This is the basic motivation
of our ASPL framework for face identification in large-scale scenarios.
9.3 Framework Overview
In this section, we illustrate how our ASPL model works. As illustrated in Fig. 9.1,
the main stages of our framework pipeline are CNN pretraining for face representa-
tion, classifier updating, self-paced high-confidence sample pseudo-labelling, low-
confidence sample annotation by active users, and CNN fine-tuning.
CNN pretraining: Before running the ASPL framework, we need to pretrain a
CNN for feature extraction on a given face dataset. These extra images are selected
without any overlap with our experimental data. Because several publicly available
CNN architectures [40, 41] have achieved remarkable success in visual recognition,
Fig. 9.1 Illustration of our proposed cost-effective framework. The pipeline includes stages of
CNN and model initialization; classifier updating; high-confidence sample labeling by the SPL and
low-confidence sample annotation by AL; and CNN fine-tuning, where the arrows represent the
workflow. The images highlighted in blue in the left panel represent the initially selected samples
9.3 Framework Overview 121
our framework supports directly employing these architectures and their pretrained
model as initialized parameters. In our experiments, AlexNet [40] is utilized. Given
the selection of extra annotated samples, we further fine-tune the CNN to learn
discriminative feature representation.
Initialization: At the beginning, we randomly select a few images for each indi-
vidual, extract feature representation for them by the pretrained CNN, and manually
annotate labels for them as the starting point.
Classifier updating: In our ASPL framework, we use one-versus-all linear SVM
as our classifier updating strategy. In the beginning, only a small portion of the
samples are labeled, and we train an initial classifier for every individual using these
samples. As the framework matures, samples manually annotated by the AL and
pseudo-labeled by the SPL increase, and we adopt them to retrain the classifiers.
High-confidence sample pseudo-labeling: We rank the unlabeled samples by their
important weights via the current classifiers, e.g., using the classification prediction
hinge loss, and then assign pseudo-labels to the top-ranked high-confidence samples.
This step can be automatically implemented by our system.
Low-confidence sample annotation: Based on certain AL criteria obtained under
the current classifiers, all unlabeled samples are ranked; then, the top-ranked samples
(most informative and generally with low confidence) are selected from the unlabeled
samples and manually annotated by active users.
CNN fine-tuning: After several steps of the interaction, we fine-tune the neural
network by the backpropagation algorithm. All samples self-labeled by the SPL and
manually annotated by the AL are added to the network, and we utilize the softmax
loss to optimize the CNN parameters via a stochastic gradient descent approach.
9.4 Formulation and Optimization
In this section, we will discuss the formulation of our proposed framework and
provide a theoretical interpretation of the entire pipeline from the perspective of
optimization. Specifically, we can theoretically justify the entire pipeline of this
framework because it is in fine accordance with a solving process for an active self-
paced learning (ASPL) optimization model. Such a theoretical understanding will
help deliver a more insightful understanding of the intrinsic mechanism underlying
the ASPL system.
In the context of face identification, suppose that we have n facial photos of
m subjects. Denote the training samples as D = {xi }i=1n
⊂ R d , where xi is the d-
dimensional feature representation of the ith sample. We have m classifiers for rec-
ognizing each sample by the one-versus-all strategy.
Knowledge learned from the data will be utilized to ameliorate our model after
a period of pace increase. Correspondingly, we denote the label set of xi as yi =
( j) ( j)
{yi ∈ {−1, 1}}mj=1 , where yi corresponds to the label of xi for the jth subject.
( j)
That is, if yi = 1, then xi is categorized as a face from the jth subject.
In our problem setting, we should make two necessary remarks. First, in our
investigated face identification problems, almost no data are labeled before running
our system. Only a very small number of samples are annotated as the initialization.
That is, most {yi }i=1
n
are unknown and need to be completed in the learning process.
In our system, a minority of the samples are manually annotated by active users, and
a majority are pseudo-labeled in a self-paced manner. Second, the data {xi }i=1
n
could
possibly be input into the system incrementally, meaning that the data scale might
be consistently growing.
Via the proposed mechanism of combining SPL and AL, our proposed ASPL
model can adaptively address both manually annotated and pseudo-labeled samples
and still progressively fit the consistently growing unlabeled data incrementally. The
ASPL is formulated as follows:

m
1
min w( j) 22 + (9.2)
/ λ}
{w,b,v,yi ∈{−1,1}m ,i ∈
j=1
2

C · L w( j) , b( j) , D, y( j) , v( j) + f v( j) ; λ j
s.t. v ∈ λ ,
where w = {w( j) }mj=1 ⊂ R d and b = {b( j) }mj=1 ⊂ R represent the weight and bias
parameters of the decision functions for all m classifiers. C(C > 0) is the standard
regularization parameter trading off the loss function and the margin, and we set C =
( j) ( j) ( j)
1 in our experiments. v = {[v1 , v2 , . . . , vn ]T }mj=1 denotes the weight variables
reflecting the training samples’ importance, and λ j is a parameter
(i.e., the pace age)
for controlling the learning pace of the jth classifier. f v( j) ; λ j is the self-paced
regularizer controlling the learning scheme. We denote the index collection of all
currently active annotated samples as λ = ∪mj=1 {λ j }, where λ j corresponds to
the set of the jth subject with the pace age λ j . Here, λ is introduced as a constraint
on yi . λ = ∩i=1
n
{iλ } composes the curriculum constraint of the model at the m
classifier pace age λ = {λ j }mj=1 . In particular, we specify two alternative types of
curriculum constraint for each sample xi , as
• iλ = [0, 1] for the pseudo-labeled sample, i.e., i ∈ / λ . Then, its weights with
( j) m
respect to all the classifiers {vi } j=1 need to be learned in the SPL optimization.
• iλ = {1} is the sample annotated by the AL process, i.e., ∃ j s.t. i ∈ λ j . Thus,
( j)
its weights are deterministically set during the model training, i.e., vi = 1.
Each type of curriculum will be interpreted in detail in Sect. 9.2. Note that in
contrast to the previous SPL settings, this curriculum iλ can be dynamically changed
with respect to all the pace ages λ of m classifiers. This confirms the superiority of
our model, as we discuss at the end ofthis section.
We then define the loss function L w( j) , b( j) , D, y( j) , v( j) on x as
9.4 Formulation and Optimization 123

L w( j) , b( j) , D, y( j) , v( j)
n
( j) ( j)
= vi l w( j) , b( j) ; xi , yi
i=1

n (9.3)
( j) ( j)
= vi 1 − yi (w( j)T xi + b( j) )
+
i=1

m
( j) ( j)
s.t. |yi + 1| ≤ 2, yi / λ ,
∈ {−1, 1}, i ∈
j=1

( j)
where 1 − yi (w( j)T xi + b( j) ) is the hinge loss of xi in the jth classifier. The
+
cost term corresponds to the summarized loss of all classifiers, and the constraint
( j)
term allows only two types of feasible solutions: (i) for any i, there exists yi = 1,
( j)
while for all other yi(k) = −1 for all k
= j; (ii) yi = −1 for all j = 1, 2, . . . , m
(i.e., background or an unknown person class). These samples xi are added to the
unknown sample set U . Clearly, such a constraint complies with real-life cases in
which a sample should be categorized in one prespecified subject or not classified in
any of the current subjects.
Referring to the known alternative search strategy, we can then solve this optimiza-
tion problem. Specifically, the algorithm is designed by alternatively updating the
classifier parameters w, b via one-versus-all SVM, the sample importance weights v
via the SPL, and the pseudo-label y via reranking. In addition to gradually increas-
ing the pace parameter λ, the optimization updates (i) the curriculum constraint λ
via AL and (ii) the feature representation via fine-tuning the CNN. In the following
section, we introduce the details of these optimization steps and their physical inter-
pretations. The correspondence of this algorithm to the practical implementation of
the ASPL system will also be discussed at the end.
Initialization: As introduced in the framework, we initialize running our system
by using a pretrained CNN to extract feature representations of all samples {xi }i=1 n
.
Set an initial m classifier pace parameter λ = {λ j } j=1 . Initialize the curriculum con-
m
straint λ with currently user-annotated samples λ and the corresponding {y( j) }mj=1
and v.
Classifier Updating: This step aims to update the classifier parameters {w( j) ,
b( j) }mj=1 by one-versus-all SVM. When {{xi }i=1
n
, v, {yi }i=1
n
, λ }, the original ASPL
model Eq. (9.2) can be simplified in the following form:

m
1
n
( j) ( j)
min w( j) 22 + C vi l w( j) , b( j) ; xi , yi ,
w,b
j=1
2 i=1
which can be equivalently reformulated to solve the following independent subopti-

mization problem for each classifier j = 1, 2, . . . , m:
1 ( j) 2 ( j) n
( j)

min w 2 + C vi l w( j) , b( j) ; xi , yi . (9.4)
w( j) ,b( j) 2 i=1
This is a standard one-versus-all SVM model with weights that takes a one-class
( j)
sample as positive and all others as negative. Specifically, when the weights vi
are only of values {0, 1}, the model corresponds to a simplified SVM model under
( j) j
the sampled instances with vi = 1; otherwise, when vi sets values from [0, 1], it
corresponds to the weighted SVM model. Both models can readily be solved by many
off-the-shelf efficient solvers. Thus, this step can be interpreted as implementing one-
versus-all SVM over manually annotated instances from the AL and self-annotated
instances from the SPL.
High-confidence Sample Labeling: This step aims to assign pseudo-labels y and
the corresponding important weights v to the top-ranked high-confidence samples.
We start by employing the SPL to rank the unlabeled samples according to their
weights v. Under fixed {w, b, {xi }i=1
n
, {yi }i=1
n
, λ }, our ASPL model in Eq. (9.2) can
be simplified to optimize v as

m
n
( j) ( j)
min C vi l w( j) , b( j) ; xi , yi + f v( j) ; λ j ,
v∈[0,1] (9.5)
j=1 i=1
s.t. v ∈ λ .
The problem then degenerates to a standard SPL problem, as in Eq. (9.1). Because
both the self-paced regularizer f (v( j) ; λ j ) and the curriculum constraint λ are
convex (with respect to v), various existing convex optimization techniques, such
as gradient-based or interior-point methods, can be used to solve it. Note that we
have multiple choices for the self-paced regularizer, as those are built in [16, 17].
All of them comply with the three axiomatic conditions required for a self-paced
regularizer, as defined in Sect. 9.2.
Based on the second axiomatic condition for a self-paced regularizer, any of the
above f (v( j) ; λ j ) tends to conduct larger weights on high-confidence (i.e., easy)
samples with fewer loss values and vice versa, which evidently facilitates the model
with the “learning from easy to hard” insight. In all our experiments, we utilize the
linear soft weighting regularizer due to its relatively easy implementation and good
adaptability to complex scenarios. This regularizer penalizes the sample weights
linearly in terms of the loss. Specifically, we have
1 ( j)
n
f (v( j) , λ j ) = λ j ( v( j) 22 − vi ), (9.6)
2 i=1
where λ j > 0. Equation (9.6) is convex with respect to v( j) , and we can, therefore,
search for its global optimum by computing the partial gradient equals. Consider-
( j)
ing vi ∈ [0, 1], we deduce the analytical solution for the linear soft weighting as
follows:

Ci j
( j) − + 1, Ci j < λ j
vi = λj (9.7)
0, otherwise,

( j)
where i j = l w( j) , b( j) ; xi , yi is the loss of xi in the jth classifier. Note that the
way to deduce Eq. (9.7) is similar to the way used in [16], but the resulting solution
is different because our ASPL model in Eq. (9.2) is new.
After obtaining the weight v for all unlabeled samples (i ∈ / λ ) according to the
optimized v( j) in descending order, we consider the samples with larger weights
high-confidence samples. We form these samples into a high-confidence sample set
S and assign them pseudo-labels: Fixing {w, b, {xi }i=1 n
, λ , v}, we optimize yi of
Eq. (9.2), which corresponds to solving

n
m
( j)
minm vi i j
yi ∈{−1,1} ,i∈S
i=1 j=1
(9.8)

m
( j)
s.t., |yi + 1| ≤ 2,
j=1
where vi is fixed and can be treated as a constant. When xi belongs to a certain person
class, Eq. (9.8) has an optimum that can be extracted exactly by the Theorem 1. The
proof is specified in the supplementary material.
( j)
Those js that satisfy w( j)T xi + b( j)
= 0 and vi ∈ (0, 1] are denoted as a set M
( j) ( j)
and set all yi = −1 for others in default. The solution of Eq. (9.8) for yi , j ∈ M
can be obtained by the following theorem.
Theorem 1 (a) If ∀ j ∈ M, w( j)T xi + b( j) < 0, Eq. (9.8) has a solution:

( j)
yi = −1, j = 1, ..., m;
( j ∗)
(b) When ∀ j ∈ M except j = j ∗ , w( j)T xi + b( j) < 0, i.e., vi i j ∗ > 0, then
Eq. (9.8) has a solution:
( j) −1, j
= j ∗
yi = ;
1, j = j ∗
(c) Otherwise, Eq. (9.8) has a solution:

( j) −1, j
= j ∗
yi = ,
1, j = j ∗
where
( j) ( j)
j ∗ = arg min vi i j − 1 + (w T xi + b( j) ) . (9.9)
1≤ j≤m +
In fact, only the high-confidence samples with positive weights, as calculated in

the last update step for v, are meaningful for the solution. This implies the physi-
cal interpretation of this optimization step: We iteratively find the high-confidence
samples based on the current classifier and then enforce pseudo-labels yi on those
top-ranked high-confidence samples (i ∈ S). This is exactly the mechanism under-
lying a reranking technique [16].
The above optimization process can be understood as the self-learning manner
of a student. The student tends to select the most high-confidence samples, which
implies an easy aspect and reliable knowledge underlying the data to be learned under
the regularization of the predesigned curriculum λ . Such regularization tends to
justify the student’s learning process by avoiding trapping him/her in an unexpected
overfitting point.
Low-confidence Sample Annotating: After pseudo-labeling high-confidence
samples in the self-paced uncertainty modeling, we employ AL to update the cur-
riculum constraint λ in the model by supplementing the curriculum with more
information based on human knowledge. The AL process aims to select the most
low-confidence unlabeled samples and to annotate them as either positive or nega-
tive by requesting user annotation. Our selection criteria are based on the classical
uncertainty-based strategy [27, 28]. Specifically, given the current classifiers, we
randomly collect a number of unlabeled samples, which are usually located in low-
confidence areas near the classification boundaries.
(1) Annotated Sample Verifying: Considering that the user annotation may con-
tain outliers (incorrectly annotated samples), we introduce a verification step to
correct the wrongly annotated samples. Assuming that labeled samples with lower
prediction scores from the current classifiers have a higher probability of being incor-
rectly labeled, we propose asking the active user to verify the annotations of these
samples. Specifically, in this step, we first employ the current classifiers to obtain the
prediction scores of all the annotated samples. Then, we rerank them and select the
top-L ones with the lowest prediction scores and ask the user to verify the selected
samples, i.e., double-check them. We can set L as a small number (L = 5 in our
experiments) because we believe the chance of the human user making mistakes is
low. In sum, we improve the robustness of the AL process by further validating the
top-L most uncertain samples with the user. In this way, we can reduce the effects
of accumulated human annotation errors and enable the classifier to be trained in a
robust manner.
(2) Low-confidence Definition: When we utilize the current classifiers (m classi-
fiers for discriminating m object categories) to predict the labels of unlabeled samples,
those predicted as having more than two positive labels (i.e., predicted as the corre-
sponding object category) actually represent these samples, making the current clas-
sifiers ambiguous. We thus adopt them as so-called “low-confidence” samples and
require active users to manually annotate them. In this step, other “low-confidence”
criteria can be utilized. We employed this simple strategy due to its intuitive ratio-
nality and efficiency.
After users perform manual annotation, we update the λ by incorporating the
newly annotated sample set φ into the current curriculum λ . For each annotated
sample, our AL process performs the following two operations: (i) set its curriculum
constraint, i.e., {iλ }i∈φ = {1} and (ii) update its labels {yi }i∈φ , and add its index to
the set of currently annotated samples λ . Such a specified curriculum still complies
with the axiomatic conditions for the curriculum constraint as defined in [15]. For
the annotated samples, the corresponding iλ = {1} with expectation value 1 over
the whole set, while for others, iλ = [0, 1] with expectation value 1/2. Thus, the
more informative samples still have a larger expectation than the others. Also, λ
is clearly nonempty and convex. It, therefore, complies with traditional curriculum
understanding.
New Class Handling: After the AL process, if the active user annotates the
selected unlabeled samples with u unseen person classes, then the new classifiers for
these unseen classes need to be initialized without affecting the existing classifiers.
Moreover, there is another difficulty in that the samples of the new class are not
enough for classifier training. Owing to the proposed ASPL framework, we employ
the following four steps to address the abovementioned issues.
(1) For each of the new class samples, search all the unlabeled samples and pick
out its K -nearest neighbors from the unseen class set U in the feature space;
(2) Require the active user to annotate these selected neighbors to enrich the
positive samples for the new person classes; and
(3) Initialize and update {w( j) , b( j) , v( j) , y( j) , λ j }m+u
j=m+1 for these new person
classes according to the abovementioned iteration process of {initialization, classifier
updating, high-confidence sample labeling, and low-confidence sample annotating}.
Algorithm 9.1: Sketch of the ASPL framework

n
Input: Input dataset {xi }i=1
Output: Model parameters w, b
n . Initialize multiple
1: Use the pretrained CNN to extract feature representations of {xi }i=1
λ
annotated samples into the curriculum and the corresponding {yi }i=1n and v. Set an initial
0
pace parameter λ = {λ } . m
while not converged do the following:
2: Update w, b by one-versus-all SVM
3: Update v by SPL via Eq. (9.7)
4: Pseudo-label high-confidence samples {yi }i∈S by reranking via Eq. (9.8)
5: Update the unclear class set U
6: Verify the annotated samples by AL.
7: Update the low-confidence samples {yi , iλ }i∈φ by AL
if u unseen classes are labeled,
Address u new classes via the steps in Sect. 9.4
Go to step 2
end if
8: In every T iterations:
n
• Update {xi }i=1 through fine-tuning the CNN
• Update λ according to Eq. (9.10)
9: end while
10: return w, b;
This step corresponds to the instructor’s role in human education, which aims to
guide a student to a more informative curriculum. In contrast to the previous fixed
curriculum setting in SPL throughout the learning process, here, the curriculum is
dynamically updated based on the self-paced learned knowledge of the model. Such
an improvement better simulates the general learning process of a good student.
As the learned knowledge of a student increases, his/her instructor should vary the
curriculum settings imposed on him/her from more in the early stage to less in the
later stage. This learning method evidently has a better learning effect that can be
adapted to the personal information of the student.
Feature Representation Updating: After several of the SPL and AL updat-
ing iterations of {w, b, {yi }i=1
n
, v, λ }, we aim to update the feature representation
{xi }i=1 through fine-tuning the pretrained CNN by inputting all manually labeled
n
samples from AL and the self-annotated samples from SPL. These samples tend to
deliver data knowledge to the network and improve the representation of the training
samples. Better feature representation is, therefore, expected to be extracted from
this fine-tuned CNN.
This learning process simulates the updating of the knowledge structure of a
human brain after a period of domain learning. Such updating tends to facilitate a
person’s ability to grasp more effective features to represent newly emerging samples
from certain domains and enables him/her to perform better as a learner. In our
experiments, we generally fine-tune the CNN after approximately 50 rounds of SPL
and AL updating, and the learning rate is set as 0.001 for all layers.
Pace Parameter Updating: We utilize a heuristic strategy to update the pace
parameter {λ j }mj=1 for m classifiers in our implementation.
After multiple iterations of the ASPL, we specifically set the pace parameter λ j
for each individual classifier and utilize a heuristic strategy in our implementation
of parameter updating. For the tth iteration, we compute the pace parameter for
optimizing Eq. (9.2) by
⎧
⎨ λ0 , t =0
λtj = λ(t−1) + α ∗ ηtj , 1≤t ≤τ (9.10)
⎩ j (t−1)
λj , t > τ,
where ηtj is the average accuracy of the jth classifier in the current iteration and α is
a parameter that controls the pace increase rate. In our experiments, we empirically
set {λ0 , α} = {0.2, 0.08}. Note that the pace parameters λ should be stopped when
all training samples are with v = {1}. Thus, we introduce an empirical threshold τ
constraint that λ is updated only in early iterations, i.e., t ≤ τ . τ is set as 12 in our
experiments.
The entire algorithm can then be summarized in Algorithm 1. It is easy to see
that this solving strategy for the ASPL model finely accords with the pipeline of our
framework.
Convergence Discussion: As illustrated in Algorithm 1, the ASPL algorithm alter-
natively updates the variables, including the classifier parameters w, b (by weighted
SVM), the pseudo-labels y (closed-form solution by Theorem 1), the weight v (by
SPL), and the low-confidence sample annotations φ (by AL). For the first three
parameters, the updates are calculated by a global optimum obtained from a sub-
problem of the original model; thus, the decrease of the objective function can be
guaranteed. However, similar to other existing AL techniques, human efforts are
involved in the loop of the AL stage; thus, the monotonic decrease of the objec-
tive function cannot be guaranteed in this step. As the learning proceeds, the model
tends to become increasingly mature, and the AL labor tends to lessen in the later
learning stage. Thus, with gradually less involvement of the AL calculation in our
algorithm, the monotonic decrease of the objective function through iteration tends
to be promised, and thus, our algorithm tends to be convergent.
References
1. L. Lin, K. Wang, D. Meng, W. Zuo, L. Zhang, Active self-paced learning for cost-effective
and progressive face identification, in IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 40, no. 1, pp. 7–19, 1 Jan. 2018
2. F. Celli, E. Bruni, B. Lepri, Automatic personality and interaction style recognition from
facebook profile pictures, in ACM Conference on Multimedia (2014)
3. Z. Stone, T. Zickler, T. Darrell, Toward large-scale face recognition using social network
context. Proc. IEEE 98, (2010)
4. Z. Lei, D. Yi, and S. Z. Li. Learning stacked image descriptor for face recognition. IEEE Trans.
Circuit. Syst. Video Technol. PP(99), 1–1 (2015)
5. S. Liao, A.K. Jain, S.Z. Li, Partial face recognition: alignment-free approach. IEEE Transactions
on Pattern Analysis and Machine Intelligence 35(5), 1193–1205 (2013)
6. D. Yi, Z. Lei, S. Z. Li, Towards pose robust face recognition, in Computer Vision and Pattern
Recognition (CVPR), 2013 IEEE Conference on, pp. 3539–3545 (2013)
7. X. Zhu, Z. Lei, J. Yan, D. Yi, S.Z. Li, High-fidelity pose and expression normalization for face
recognition in the wild, in 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 787–796 (2015)
8. Yi Sun, Xiaogang Wang, and Xiaoo Tang. Hybrid deep learning for face verification. In Proc.
of IEEE International Conference on Computer Vision (2013)
9. X. Wang, X. Guo, S. Z. Li, Adaptively unified semi-supervised dictionary learning with active
points, in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1787–1795
(2015)
10. Yu-Feng Li, Zhi-Hua Zhou, Towards making unlabeled data never hurt. IEEE Trans. Pattern
Anal. Mach. Intelligence 37(1), 175–188 (2015)
11. H. Zhao et al., A novel incremental principal component analysis and its application for face
recognition (SMC, IEEE Transactions on, 2006)
12. T.-K. Kim, K.-Y. Kenneth Wong, B. Stenger, J. Kittler, R. Cipolla, Incremental linear discrimi-
nant analysis using sufficient spanning set approximations, in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition (2007)
13. E. Ehsan, S. Guillermo, Y. Allen, S.S. Shankar, A convex optimization framework for active
learning, in Proceedings of IEEE International Conference on Computer Vision (2013)
14. K. Wang, D. Zhang, Y. Li, R. Zhang, L. Lin, Cost-effective active learning for deep image
classification. IEEE Trans. Circuits Syst. Video Technol. PP(99), 1–1 (2016)
15. L. Jiang, D. Meng, Q. Zhao, S. Shan, A.G. Hauptmann, Self-paced curriculum learning. Pro-
ceedings of AAAI Conference on Artificial Intelligence (2015)
16. L. Jiang, D. Meng, T .Mitamura, A.G. Hauptmann, Easy samples first: self-paced reranking
for zero-example multimedia search, in ACM Conference on Multimedia (2014)
17. L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, A. Hauptmann, Self-paced learning with diversity,
in Proceedings of Advances in Neural Information Processing Systems (2014)
18. Y. Bengio, J. Louradour, R. Collobert, J. Weston, Curriculum learning, in Proceedings of IEEE
International Conference on Machine Learning (2009)
19. M Pawan Kumar et al., Self-paced learning for latent variable models, in Proceedings of
Advances in Neural Information Processing Systems (2010)
20. G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S.Z. Li, T. Hospedales, When face recognition
meets with deep learning: an evaluation of convolutional neural networks for face recognition,
in The IEEE International Conference on Computer Vision (ICCV) Workshops (2015)
21. Y. LeCun, K. Kavukcuoglu, C. Farabet, Convolutional networks and applications in vision, in
ISCAS (2010)
22. K. Wang, L. Lin, W. Zuo, S. Gu, L. Zhang, Dictionary pair classifier driven convolutional
neural networks for object detection, in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 2138–2146, June 2016
23. L.I. Smith, A tutorial on principal components analysis. Cornell University, USA 51, 52 (2002)
24. M. Karasuyama, I. Takeuchi, Multiple incremental decremental learning of support vector
machines, in Proceedings of Advances in Neural Information Processing Systems (2009)
25. N.-Y. Liang et al., A fast and accurate online sequential learning algorithm for feedforward
networks (Neural Networks, IEEE Transactions on, 2006)
26. S. Ozawa et al., Incremental learning of feature space and classifier for face recognition. Neural
Networks 18, (2005)
27. D.D. Lewis, W.A. Gale, A sequential algorithm for training text classifiers, in ACM SIGIR
Conference (1994)
28. S. Tong, D. Koller, Support vector machine active learning with applications to text classifica-
tion. J. Mach. Learn. Res. 2, (2002)
29. A.K. McCallumzy, K. Nigamy, Employing em and pool-based active learning for text classi-
fication, in Proceedings of IEEE International Conference on Machine Learning (1998)
30. A.J. Joshi, F. Porikli, N. Papanikolopoulos, Multi-class active learning for image classification,
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009)
31. A. Kapoor, G. Hua, A. Akbarzadeh, S. Baker, Which faces to tag: Adding prior constraints into
active learning, in Proceedings of IEEE International Conference on Computer Vision (2009)
32. A. Kapoor, K. Grauman, R. Urtasun, T. Darrell, Active learning with gaussian processes for
object categorization, in Proceedings of IEEE International Conference on Computer Vision
(2007)
33. X. Li, Y. Guo, Adaptive active learning for image classification, in Proceedings of IEEE Con-
ference on Computer Vision and Pattern Recognition (2013)
34. K. Brinker, Incorporating diversity in active learning with support vector machines, in Pro-
ceedings of IEEE International Conference on Machine Learning (2003)
35. Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, A.G. Hauptmann, Self-paced learning for matrix
factorization, in Proceedings of AAAI Conference on Artificial Intelligence (2015)
36. M.P. Kumar, H. Turki, D. Preston, D. Koller, Learning specific-class segmentation from diverse
data, in Proceedings of IEEE International Conference on Computer Vision (2011)
37. Y.J. Lee, K. Grauman, Learning the easy things first: Self-paced visual category discovery, in
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2011)
38. J.S. Supancic, D. Ramanan, Self-paced learning for long-term tracking, in Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition (2013)
39. S. Yu et al., Cmu-informedia@ trecvid 2014 multimedia event detection, in TRECVID Video
Retrieval Evaluation Workshop (2014)
40. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional
neural networks. Advances in Neural Information Processing Systems 25, 1097–1105 (2012)
41. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recogni-
tion, in ICLR (2015)
Part V
Higher Level Tasks
In computer vision, increasing attention has been paid to understanding human activ-
ity to determine what people are doing in a given video in different application
domains, e.g., intelligent surveillance, robotics, and human–computer interaction.
Recently developed 3D/depth sensors have opened new opportunities with enor-
mous commercial value by providing richer information (e.g., extra depth data of
scenes and objects) than traditional cameras. By building upon this enriched infor-
mation, human poses can be estimated more easily. However, modeling complicated
human activities remains challenging.
Many works on human action/activity recognition focus mainly on designing
robust and descriptive features [1, 2]. For example, Xia and Aggarwal [1] extract
spatiotemporal interest points from depth videos (DSTIP) and developed a depth
cuboid similarity feature (DCSF) to model human activities. Oreifej and Liu [2] pro-
pose capturing spatiotemporal changes in activities by using a histogram of oriented
4D surface normals (HON4D). Most of these methods, however, overlook detailed
spatiotemporal structural information and limited periodic activities.
Several compositional part-based approaches that have been studied for complex
scenarios have achieved substantial progress [3, 4]; they represent an activity with
deformable parts and contextual relations. For instance, Wang et al. [3] recognized
human activities in common videos by training the hidden conditional random fields
in a max-margin framework. For activity recognition in RGB-D data, Packer et al.
[5] employed the latent structural SVM to train the model with part-based pose
trajectories and object manipulations. An ensemble model of actionlets was studied
in [4] to represent 3D human activities with a new feature called the local occupancy
pattern (LOP). To address more complicated activities with large temporal variations,
some improved models discover the temporal structures of activities by localizing
sequential actions. For example, Wang and Wu [6] propose solving the temporal
alignment of actions by maximum margin temporal warping. Tang et al. [7] capture
the latent temporal structures of 2D activities based on the variable-duration hidden
Markov model. Koppula and Saxena [8] apply conditional random fields to model
the subactivities and affordances of the objects for 3D activity recognition.
In the depth video scenario, Packer et al. [5] address action recognition by mod-
eling both pose trajectories and object manipulations with a latent structural SVM.
Wang et al. [4] develop an actionlet ensemble model and a novel feature called the
132 Part V: Higher Level Tasks
local occupancy pattern (LOP) to capture intraclass variance in 3D action. However,

these methods address only small time period action recognition, in which temporal
segmentation matters only slightly.
Recently, AND/OR graph representations have been introduced as extensions
of part-based models [9, 10] and produce very competitive performance to address
large data variations. These models incorporate not only hierarchical decompositions
but also explicit structural alternatives (e.g., different ways of composition). Zhu
and Mumford [9] first explore AND/OR graph models for image parsing. Pei et al.
[10] then introduce the models for video event understanding, but their approach
requires elaborate annotations. Liang et al. [11] propose training the spatiotemporal
AND/OR graph model using a nonconvex formulation, which is discriminatively
trained on weakly annotated training data. However, the abovementioned models
rely on handcrafted features, and their discriminative capacities are not optimized
for 3D human activity recognition.
The past few years have seen a resurgence of research on the design of deep neural
networks, and impressive progress has been made on learning image features from
raw data [12, 13]. To address human action recognition from videos, Ji et al. [14]
develop a novel deep architecture of convolutional neural networks to extract features
from both spatial and temporal dimensions. Luo et al. [15] propose incorporating
a new switchable restricted Boltzmann machine (SRBM) to explicitly model the
complex mixture of visual appearances for pedestrian detection; they train their
model using an EM-type iterative algorithm. Amer and Todorovic [16] apply sum-
product networks (SPNs) to model human activities based on variable primitive
actions.
References
1. L. Xia, J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity

recognition using depth camera, in CVPR, pp. 2834–2841 (2013)
2. O. Oreifej, Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recog-
nition from depth sequences, in CVPR, pp. 716–723 (2013)
3. Y. Wang, G. Mori, Hidden part models for human action recognition: Probabilistic
vs. max-margin. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1310–1323 (2011)
4. J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition
with depth cameras, in: CVPR, pp. 1290–1297 (2012)
5. B. Packer, K. Saenko, D. Koller, A combined pose, object, and feature model for
action understanding, in CVPR, pp. 1378–1385 (2012)
6. J. Wang, Y. Wu, Learning maximum margin temporal warping for action recog-
nition, in ICCV, pp. 2688–2695 (2013)
7. K. Tang, L. Fei-Fei, D. Koller, Learning latent temporal structure for complex
event detection, in CVPR, pp. 1250–1257 (2012)
8. H.S. Koppula, A. Saxena, Learning spatio-temporal structure from rgb-d videos
for human activity detection and anticipation, in ICML, pp. 792–800 (2013)
References 133
9. S. Zhu, D. Mumford, A stochastic grammar of images. Found. Trends Comput.

Graph. Vis. 2(4), 259–362 (2007)
10. M. Pei, Y. Jia, S. Zhu, Parsing video events with goal inference and intent pre-
diction, in ICCV, pp. 487–494 (2011)
11. X. Liang, L. Lin, L. Cao, Learning latent spatio-temporal compositional model
for human action recognition, in ACM Multimedia, pp. 263–272 (2013)
12. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural
networks. Science 313 (5786), 504–507 (2006)
13. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
volutional neural networks. Adv. Neural Inf. Process. Syst., 1097–1105 (2012)
14. S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
15. P. Luo, Y. Tian, X. Wang, X. Tang, Switchable deep network for pedestrian
detection, in CVPR (2014)
16. M.R. Amer, S. Todorovic, Sum-product networks for modeling activities with
stochastic structure, in: CVPR, pp. 1314–1321 (2012)
Chapter 10
Human Activity Understanding
Abstract Understanding human activity is very challenging even with recently

developed 3D/depth sensors. To solve this problem, this work investigates a novel
deep structured model that adaptively decomposes an activity into temporal parts
using convolutional neural networks (CNNs). The proposed model advances two
aspects of the traditional deep learning approaches. First, a latent temporal struc-
ture is introduced into the deep model, accounting for large temporal variations in
diverse human activities. In particular, we utilize the latent variables to decompose
the input activity into a number of temporally segmented subactivities and feed them
into the parts (i.e., subnetworks) of the deep architecture. Second, we incorporate a
radius-margin bound as a regularization term into our deep model, which effectively
improves the generalization performance for classification (Reprinted by permis-
sion from Springer Nature Customer Service Centre GmbH: Springer International
Journal of Computer Vision [1] c
2019).
10.1 Introduction
Most previous methods recognize 3D human activities by training discrimina-

tive/generative classifiers based on carefully designed features [2–5]. These
approaches often require sufficient domain knowledge and heavy feature engineer-
ing because of the difficulty (a), which could limit their application. To improve the
discriminative performance, certain compositional methods [6, 7] model complex
activities by segmenting videos into temporal segments of fixed length. However,
because of the difficulty of this task (b), they may have problems segmenting com-
plex activities composed of actions of diverse temporal durations, e.g., the examples
in Fig. 10.1.
In this work, we develop a deep structured human activity model to address the
abovementioned challenges and demonstrate superior performance in comparison
to other state-of-the-art approaches in the task of recognizing human activities from
grayscale-depth videos captured by a RGB-D camera (i.e., Microsoft Kinect). Our
model adaptively represents the input activity instance as a sequence of tempo-
rally separated subactivities, and each instance is associated with a cubic-like video

https://doi.org/10.1007/978-981-13-2387-4_10
136 10 Human Activity Understanding
Fig. 10.1 Two activities of the same category. We consider one activity as a sequence of actions
that occur over time; the temporal composition of an action may differ for different subjects
segment of a flexible length. Our model is inspired by the effectiveness of two widely
successful techniques: deep learning [8–13] and latent structure models [14–18]. One
example of the former is the convolutional neural network (CNN), which was recently
applied to generate powerful features for video classification [13, 19]. On the other
hand, latent structure models (such as the deformable part-based model [15]) have
been demonstrated to be an effective class of models for managing large object vari-
ations for recognition and detection. One of the key components of these models is
the reconfigurable flexibility of the model structure, which is often implemented by
estimating latent variables during inference.
We adopt the deep CNN architecture [8, 13] to layer-wise extract features from
the input video data, and the architecture is vertically decomposed into several sub-
networks corresponding to the video segments, as Fig. 10.2 illustrates. In particular,
our model searches for the optimal composition of each activity instance during
recognition, which is the key to managing temporal variation in human activities.
Moreover, we introduce a relaxed radius-margin bound into our deep model, which
effectively improves the generalization performance for classification.
10.2 Deep Structured Model
In this section, we introduce the main components of our deep structured model,
including the spatiotemporal CNNs, the latent structure of activity decomposition,
and the radius-margin bound for classification.
10.2 Deep Structured Model 137
10.2.1 Spatiotemporal CNNs
We propose an architecture of spatiotemporal convolutional neural networks (CNNs),

as Fig. 10.2 illustrates. In the input layer, the activity video is decomposed into
M video segments, with each segment associated with one separated subactivity.
Accordingly, the proposed architecture consists of M subnetworks to extract fea-
tures from the corresponding decomposed video segments. Our spatiotemporal CNNs
involve both 3D and 2D convolutional layers. The 3D convolutional layer extracts
spatiotemporal features to jointly capture appearance and motion information and is
followed by a max-pooling operator to improve the robustness against local deforma-
tions and noise. As shown in Fig. 10.2, each subnetwork (highlighted by the dashed
box) is two stacked 3D convolutional layers and one 2D convolutional layer. For
the input to each subnetwork, the number of frames is very small (e.g., 9). After
two layers of 3D convolution followed by max-pooling, the temporal dimension of
each set of feature maps is too small to perform a 3D convolution. Thus, we stack
a 2D convolutional layer upon the two 3D convolutional layers. The outputs from
the different subnetworks are merged to be fed into one fully connected layer that
generates the final feature vector of the input video.
10.2.2 Latent Temporal Structure
In contrast to traditional deep learning methods with fixed architectures, we incor-

porate the latent structure into the deep model to flexibly adapt to the input video
Fig. 10.2 The architecture of spatiotemporal convolutional neural networks. The neural networks
are stacked convolutional layers, max-pooling operators, and a fully connected layer, where the raw
segmented videos are treated as the input. A subnetwork is referred to as a vertically decomposed
subpart with several stacked layers that extracts features for one segmented video section (i.e.,
one subactivity). Moreover, by using the latent variables, our architecture is capable of explicitly
handling diverse temporal compositions of complex activities
Fig. 10.3 Illustration of incorporating the latent structure into the deep model. Different subnet-
works are denoted by different colors
during inference and learning. Assume that the input video is temporally divided into
a number M of segments corresponding to the subactivities. We index each video
segment by its starting anchor frame s j and its temporal length (i.e., the number of
frames) t j . t j is greater than m, i.e., t j ≥ m. To address the large temporal varia-
tion in human activities, we make s j and t j variables. Thus, for all video segments,
we denote the indexes of the starting anchor frames as (s1 , ..., s M ) and their tem-
poral lengths as (t1 , ..., t M ); these are regarded as the latent variables in our model,
h = (s1 , ..., s M , t1 , ..., t M ). These latent variables specifying the segmentation will
be adaptively estimated for different input videos.
We associate the CNNs with the video segmentation by feeding each segmented
part into a subnetwork, as Fig. 10.2 illustrates. Next, according to the method of
video segmentation (i.e., decomposition of subactivities), we manipulate the CNNs
by inputting the sampled video frames. Specifically, each subnetwork takes m video
frames as the input, and if some frames are more than m, according to the latent
variables, e.g., t j > m, then a uniform sampling is performed to extract m key frames.
Figure 10.3 shows an intuitive example of our structured deep model in which the
input video is segmented into three sections corresponding to the three subnetworks
in our deep architecture. Thus, the configuration of the CNNs is dynamically adjusted
in addition to searching for the appropriate latent variables of the input videos. Given
the parameters of the CNNs ω and the input video xi with its latent variables h i , the
generated feature of xi can be represented as φ(xi ; ω, h i ).
10.2.3 Deep Model with Relaxed Radius-Margin Bound
A large amount of training data is crucial for the success of many deep learning
models. Given sufficient training data, the effectiveness of applying the softmax
classifier to CNNs has been validated for image classification [20]. However, for 3D
human activity recognition, the available training data are usually less than expected.
For example, the CAD-120 dataset [21] consists of only 120 RGB-D sequences of 10
categories. In this scenario, although parameter pretraining and dropout are available,
the model training often suffers from overfitting. Hence, we consider introducing a
more effective classifier in addition to the regularizer to improve the generalization
performance of the deep model.
In supervised learning, the support vector machine (SVM), also known as the max-
margin classifier, is theoretically sound and generally can achieve promising perfor-
mance compared with the alternative linear classifiers. In deep learning research, the
combination of SVM and CNNs has been exploited [22] and has obtained excellent
results in object detection [23]. Motivated by these approaches, we impose a max-
margin classifier (w, b) upon the feature generated by the spatiotemporal CNNs for
human activity recognition.
As a max-margin classifier, standard SVM adopts w2 , the reciprocal of the
squared margin γ 2 , as the regularizer. However, the generalization error bound of
SVM depends on the radius-margin ratio R 2 /γ 2 , where R is the radius of the mini-
mum enclosing ball (MEB) of the training data [24]. When the feature space is fixed,
the radius R is constant and can, therefore, be ignored. However, in our approach, the
radius R is determined by the MEB of the training data in the feature space generated
by the CNNs. In this scenario, there is a risk that the margin can be increased by
simply expanding the MEB of the training data in the feature space. For example,
simply multiplying a constant to the feature vector can enlarge the margin between
the positive and negative samples, but obviously, this approach will not enable better
classification. To overcome this problem, we incorporate the radius-margin bound
into the feature learning, as Fig. 10.4 illustrates. In particular, we impose a max-
margin classifier with radius information upon the feature generated by the fully
connected layer of the spatiotemporal CNNs. The optimization tends to maximize
the margin while shrinking the MEB of the training data in the feature space, and we
thus obtain a tighter error bound.
Suppose there is a set of N training samples (X, Y ) = {(x1 , y1 ), ... , (x N , y N )},
where xi is the video, y ∈ {1, ..., C} represents the category labels, and C is the
number of activity categories. We extract the feature for each xi by the spatiotemporal
CNNs, φ(xi ; ω, h i ), where h i refers to the latent variables. By adopting the squared
hinge loss and the radius-margin bound, we define the following loss function L 0 of
our model:
Fig. 10.4 Illustration of our deep model with the radius-margin bound. To improve the generaliza-
tion performance for classification, we propose integrating the radius-margin bound as a regularizer
with feature learning. Intuitively, as well as optimizing the max-margin parameters (w, b), we shrink
the radius R of the minimum enclosing ball (MEB) of the training data that are distributed in the
feature space generated by the CNNs. The resulting classifier with the regularizer shows better
generalization performance than the traditional softmax output
Radius−margin Ratio

1
L0 = w2 Rφ2
2 (10.1)

N
T 2
+λ max 0, 1 − w φ(xi ; ω, h i ) + b yi ,
i=1
where λ is the trade-off parameter, 1/w denotes the margin of the separating hyper-
plane, b denotes the bias, and Rφ denotes the radius of the MEB of the training data
φ(X, ω, H ) = {φ(x1 ; ω, h 1 ), ..., φ(x N ; ω, h N )} in the CNN feature space. Formally,
the radius Rφ is defined as [24, 25],
Rφ2 = min R 2 , s.t.φ(xi ; ω, h i ) − φ0 2 ≤ R 2 , ∀i. (10.2)

R,φ0
The radius Rφ is implicitly defined by both the training data and the model param-
eters, meaning (i) the model in Eq. (10.1) is highly nonconvex, (ii) the derivative of
Rφ with respect to ω is hard to compute, and (iii) the problem is difficult to solve using
the stochastic gradient descent (SGD) method. Motivated by the radius-margin-based
SVM [26, 27], we investigate using the relaxed form to replace the original definition
of Rφ in Eq. (10.2). In particular, we introduce the maximum pairwise distance R̃φ
over all the training samples in the feature space as
R̃φ2 = max φ(xi ; ω, h i ) − φ(x j ; ω, h j )2 . (10.3)

i, j
Do and Kalousis [26] proved that Rφ could be well bounded by R̃φ with the
Lemma 2,
Lemma 2 √
1+ 3
R̃φ ≤ Rφ ≤ R̃φ .
2
The abovementioned lemma guarantees that the true radius Rφ can be well approx-
imated by R̃φ . With the proper parameter η, the optimal solution for minimizing the
radius-margin ratio w2 Rφ2 is the same as that for minimizing the radius-margin
sum w2 + η Rφ2 [26]. Thus, by approximating Rφ2 with R̃φ2 and replacing the radius-
margin ratio with the radius-margin sum, we suggest the following deep model with
the relaxed radius-margin bound:
1
L1 = w2 + max φ(xi ; ω, h i ) − φ(x j ; ω, h j )2
2 i, j

(10.4)
N
2
+λ max 0, 1 − w T φ(xi ; ω, h i ) + b yi .
i=1
However, the first max operator in Eq. (10.4) is defined over all training sample
pairs, and the minibatch-based SGD optimization method is, therefore, unsuitable.
Moreover, the radius in Eq. (10.4) is determined by the maximum distances of the
sample pairs in the CNN feature space, and it might be sensitive to outliers. To address
these issues, we approximate the max operator with a softmax function, resulting in
the following model:
1
L2 = w2 + η κi j φ(xi ; ω, h i ) − φ(x j ; ω, h j )2
2 i, j

2 (10.5)
N

+λ max 0, 1 − w φ(xi ; ω, h i ) + b yi
T
i=1
with
exp(αφ(xi ; ω, h i ) − φ(x j ; ω, h j )2 )
κi j = , (10.6)
i j exp(αφ(x i ; ω, h i ) − φ(x j ; ω, h j ) )
2
where α ≥ 0 is the parameter used to control the degree of approximation of the

hard max operator. When α is infinite, the approximation in Eq. (10.5) becomes
the model in Eq. (10.4). Specifically, when α = 0, κi j = 1/N 2 , and the relaxed loss
function can be reformulated as
1
L 3 = w2 + 2η φ(xi ; ω, h i ) − φ̄ω 2
2 i

(10.7)
N
T 2
+λ max 0, 1 − w φ(xi ; ω, h i ) + b yi
i=1
with
1
φ̄ω = φ(xi ; ω, h i ). (10.8)
N i
The optimization objects in Eqs. (10.5) and (10.7) are two relaxed losses of our
deep model with the strict radius-margin bound in Eq. (10.1). The derivatives of the
relaxed losses with respect to ω are easy to compute, and the models can be readily
solved via SGD, which will be discussed in detail in Sect. 10.4.
10.3 Implementation
In this section, we first explain the implementation that makes our model adaptive
to an alterable temporal structure and then describe the detailed setting of our deep
architecture.
10.3.1 Latent Temporal Structure
During our learning and inference procedures, we search for the appropriate latent
variables that determine the temporal decomposition of the input video (i.e., the
decomposition of activities). There are two parameters relating to the latent vari-
ables in our model: the number M of video segments and the temporal length m of
each segment. Note that the subactivities decomposed by our model have no precise
definition in a complex activity, i.e., actions can be ambiguous depending on the
temporal scale being considered.
To incorporate the latent temporal structure, we associate the latent variables with
the neurons (i.e., convolutional responses) in the bottom layer of the spatiotemporal
CNNs.
The choice of the number of segments M is important for the performance of
3D human activity recognition. The model with a small M could be less expressive
in addressing temporal variations, while a large M could lead to overfitting due to
high complexity. Furthermore, when M = 1, the model latent structure is disabled,
and our architecture degenerates to the conventional 3D-CNNs [13]. By referring
to the setting of the number of parts for the deformable part-based model [15] in
object detection, the value M can be set by cross-validation on a small set. In all our
experiments, we set M = 4.
Considering that the number of frames of the input videos is diverse, we develop
a process to normalize the inputs by two-step sampling in the learning and inference
procedure. First, we sample 30 anchor frames uniformly from the input video. Based
on these anchor frames, we search for all possible nonoverlapping temporal segmen-
tations, and the anchor frame segmentation corresponds to the segmentation of the
input video. Then, from each video segment (indicating a subactivity), we uniformly
10.3 Implementation 143
sample m frames to feed the neural networks, and in our experiments, we set m = 9.
In addition, we reject the possible segmentations that cannot offer m frames for any
video segment.
For an input video, the possibility of temporal structure variations (i.e., the possible
enumeration number of anchor frame segmentations) is 115 in our experiments.
10.3.2 Architecture of Deep Neural Networks
The proposed spatiotemporal CNN architecture is constructed by stacking two 3D

convolutional layers, one 2D convolutional layer, and one fully connected layer, and
the max-pooling operator is deployed after each 3D convolutional layer. Below, we
introduce the definitions and implementations of these components of our model.
3D Convolutional Layer. The 3D convolutional operation is adopted to perform
convolutions spanning both spatial and temporal dimensions to characterize both
appearance and motion features [13]. Suppose p is the input video segment with
width w, height h, and number of frames m, ω is the 3D convolutional kernel with
width w , height h , and temporal length m . As shown in Fig. 10.5, a feature map v
can be obtained by performing 3D convolutions from the sth to the (s + m − 1)th
frames, where the response for the position (x, y, s) in the feature map is defined as
Fig. 10.5 Illustration of the 3D convolutions across both spatial and temporal domains. In this
example, the temporal dimension of the 3D kernel is 3, and each feature map is thus obtained by
performing 3D convolutions across 3 adjacent frames

k −1 h −1 m
−1
vx ys = tanh(b + ωi jk · p(x+i)(y+ j)(s+k) ), (10.9)
i=0 j=0 k=0
where p(x+i)(y+ j)(s+k) denotes the pixel value of the input video p at position (x +
i, y + j) in the (s + k)th frame, ωi jk denotes the value of the convolutional kernel
ω at position (i, j, k), b stands for the bias, and tanh denotes the hyperbolic tangent
function. Thus, given p and ω, m − m + 1 feature maps can be obtained, each with
a size of (w − w + 1, h − h + 1).
Based on the 3D convolutional operation, a 3D convolutional layer is designed
for spatiotemporal feature extraction by considering the following three issues:
• Number of convolutional kernels. The feature maps generated by one convolutional

kernel are limited in capturing appearance and motion information. To generate
more types of features, several kernels are employed in each convolutional layer.
We define the number of 3D convolutional kernels in the first layer as c1 . After the
first 3D convolutions, we obtain c1 sets of m − m + 1 feature maps. Then, we use
3D convolutional kernels on the c1 sets of feature maps and obtain c1 × c2 sets of
feature maps after the second 3D convolutional layer.
• Decompositional convolutional networks. Our deep model consists of M sub-
networks, and the input video segment for each subnetwork involves m frames
(the later frames might be unavailable). In the proposed architecture, all of the
subnetworks use the same structure, but each has its own convolutional kernels.
For example, the kernels belonging to the first subnetwork are deployed only to
perform convolutions on the first temporal video segment. Thus, each subnetwork
generates specific feature representations for one subactivity.
• Application to gray-depth video. The RGB images are first converted to gray-
level images, and the gray-depth video is then adopted as the input to the neural
networks. The 3D convolutional kernels in the first layer are applied to both the
gray channel and the depth channel in the video, and the convolutional results of
these two channels are further aggregated to produce the feature maps. Note that
the dimensions of the features remain the same as those from only one channel.
In our implementation, the input frame is scaled with height h = 80 and width
w = 60. In the first 3D convolutional layer, the number of 3D convolutional kernels
is c1 = 7, and the size of the kernel is w × h × m = 9 × 7 × 3. In the second
layer, the number of 3D convolutional kernels is c2 = 5, and the size of the kernel
is w × h × m = 7 × 7 × 3. Thus, we have 7 sets of feature maps after the first
3D convolutional layer and obtain 7 × 5 sets of feature maps after the second 3D
convolutional layer.
Max-pooling Operator. After each 3D convolution, the max-pooling operation is
introduced to enhance the deformation and shift invariance [20]. Given a feature map
with a size of a1 × a2 , a d1 × d2 max-pooling operator is performed by taking the
maximum of every nonoverlapping d1 × d2 subregion of the feature map, resulting in
an a1 /d1 × a2 /d2 pooled feature map. In our implementation, a 3 × 3 max-pooling
operator was applied after every 3D convolutional layer. After two layers of 3D
10.3 Implementation 145
convolutions and max-pooling, for each subnetwork, we have 7 × 5 sets of 6 × 4 × 5

feature maps.
2D Convolutional Layer. After two layers of 3D convolutions followed by max-
pooling, a 2D convolution is employed to further extract higher level complex fea-
tures. The 2D convolution can be viewed as a special case of 3D convolution with
m = 1, which is defined as

k −1 h −1
vx y = tanh(b + ωi j · p(x+i)(y+ j) ), (10.10)
i=0 j=0
where p(x+i)(y+ j) denotes the pixel value of the feature map p at position (x + i, y +
j), ωi j denotes the value of the convolutional kernel ω at position (i, j), and b denotes
the bias. In the 2D convolutional layer, if the number of 2D convolutional kernels is
c3 , then c1 × c2 × c3 sets of new feature maps are obtained by performing 2D con-
volutions on c1 × c2 sets of feature maps generated by the second 3D convolutional
layer.
In our implementation, the number of 2D convolutional kernels is set as c3 = 4
with a kernel size of 6 × 4. Hence, for each subnetwork, we can obtain 700 feature
maps with a size of 1 × 1.
Fully Connected Layer. There is only one fully connected layer with 64 neu-
rons in our architecture. All these neurons connect to a vector of 700 × 4 = 2800
dimensions, which is generated by concatenating the feature maps from all the sub-
networks. Because the training data are insufficient, and a large number of param-
eters (i.e., 179200) exist in this fully connected layer, we adopt the commonly used
dropout trick with a 0.6 rate to prevent overfitting. The margin-based classifier is
defined based on the output of the fully connected layer, where we adopt the squared
hinge loss to predict the activity categories as
θ (z) = arg max(wiT z + bi ), (10.11)

i
where z is the 64-dimensional vector from the fully connected layer, and {wi , bi }
denotes the weight and bias connected to the ith activity category.
10.4 Learning Algorithm
The proposed deep structured model involves three components to be optimized: (i)
the latent variables H that manipulate the activity decomposition, (ii) the margin-
based classifier {w, b}, and (iii) the CNN parameters ω. The latent variables are not
continuous and need to be estimated adaptively for different input videos, making
the standard backpropagation algorithm [8] unsuitable for our deep model. In this
section, we present a joint component learning algorithm that iteratively optimizes
the three components. Moreover, to overcome the problem of insufficient 3D data,
we propose to borrow a large number of 2D videos to pretrain the CNN parameters

in advance.
10.4.1 Joint Component Learning
If (X, Y ) = {(x1 , y1 ), ... , (x N , y N )} are denoted as the training set with N examples,
where xi is the video, then yi ∈ {1, ..., C} denotes the activity category. Denote
H = {h 1 , ..., h N } as the set of latent variables for all training examples. The model
parameters to be optimized can be divided into three groups, i.e., H , {w, b}, and ω.
Fortunately, given any two groups of parameters, the other group of parameters can be
efficiently learned using either the stochastic gradient descent (SGD) algorithm (e.g.,
for {w, b} and ω) or enumeration (e.g., for H ). Thus, we conduct the joint component
learning algorithm by iteratively updating the three groups of parameters with three
steps: (i) Given the model parameters {w, b} and ω, we estimate the latent variables
h i for each video and update the corresponding feature φ(xi ; ω, h i ) (Fig. 10.6a);
(b)
(a) (c)
Fig. 10.6 Illustration of our joint component learning algorithm, which iteratively performs in
three steps: a Given the classification parameters {w, b} and the CNN parameters ω, we estimate
the latent variables h i for each video and generate the corresponding feature φ(xi ; ω, h i ); b given
the updated features φ(X ; ω, H ) for all training examples, the classifier {w, b} is updated via
SGD; and (c) given {w, b} and H , backpropagation updates the CNN parameters ω
10.4 Learning Algorithm 147
(ii) given the updated features φ(X ; ω, H ), we adopt SGD to update the max-margin
classifier {w, b} (Fig. 10.6b); and (iii) given the model parameters {w, b} and H , we
employ SGD to update the CNN parameters ω, which will lead to both an increase
in the margin and a decrease in the radius Fig. 10.6b. It is worth mentioning that
the two steps (ii) and (iii) can be performed in the same SGD procedure; i.e., their
parameters are jointly updated.
Below, we explain in detail the three steps for minimizing the losses in Eqs. (10.5)
and (10.7), which are derived from our deep model.
(i) Given the model parameters ω and {w, b}, for each sample (xi , yi ), the most
appropriate latent variables h i can be determined by exhaustive searching over all
possible choices,

h i∗ = arg min 1 − wφ(xi ; ω, h i ) + b yi . (10.12)
hi
GPU programming is employed to accelerate the search process. With the updated
latent variables, we further obtain the feature set φ(X ; ω, H ) of all the training data.
(ii) Given φ(X ; ω, H ) and the CNN parameters ω, batch stochastic gradient
descent (SGD) is adopted to update the model parameters in Eqs. (10.5) and (10.7).
In iteration t, a batch Bt ⊂ (X, Y, H ) of k samples is chosen. We can obtain the
gradients of the max-margin classifier with respect to parameters {w, b},

∂L
=w−λ yi φ(xi ; ω, h i ) max 0, 1 − w T φ(xi ; ω, h i ) + b yi , (10.13)
∂w
(xi ,yi ,h i )∈Bt

∂L
= −2λ yi max 0, 1 − w T φ(xi ; ω, h i ) + b yi , (10.14)
∂b
(xi ,yi ,h i )∈Bt
where L can be either the loss L 2 or the loss L 3 .

(iii) Given the latent variables H and the max-margin classifier {w, b}, based
on the gradients with respect to ω, the backpropagation algorithm can be adopted
to learn the CNN parameters ω. To minimize L 2 in Eq. (10.5), we first update the
weights κi j in Eq. (10.6) based on φ(X ; ω, H ) and then introduce the variables κi
and φi ,

κi = κi j , (10.15)
j

φi = κi j φ(x j ; ω, h j ). (10.16)
j
With κi and φi , based on batch SGD, the derivative of the spatiotemporal CNNs is
∂ L2 ∂φ(xi ; ω, h i )
= 4η (κi φ(xi ; ω, h i ) − φi )T
∂ω ∂ω
(xi ,yi ,h i )∈Bt

(10.17)
∂φ(xi ; ω, h i )
− 2λ yi w T
max 0, 1 − w T φ(xi ; ω, h i ) + b yi .
∂ω
When α = 0, we first update the mean φ̄ω in Eq. (10.8) based on φ(X ; ω, H ) and
then compute the derivative of the relaxed loss in Eq. (10.7) as
∂ L3 T ∂φ(xi ; ω, h i )
= 4η φ(xi ; ω, h i ) − φ̄ω
∂ω ∂ω
(xi ,yi ,h i )∈Bt

(10.18)
∂φ(xi ; ω, h i )
− 2λ w T yi max 0, 1 − w T φ(xi ; ω, h i ) + b yi .
∂ω
By performing the backpropagation algorithm, we can further decrease the relaxed

loss and optimize the model parameters. Note that during backpropagation, batch
SGD is adopted to update the parameters, and the update stops when it runs through
all the training samples once. The optimization algorithm iterates between these three
steps until convergence.
Algorithm 2: Learning Algorithm

Input:
The labeled 2D, 3D activity dataset and learning rate αw,b , αω .
Output:
Model parameters {ω, w, b}.
Initialization:
Pretrain the spatiotemporal CNNs using the 2D videos.
Learning on the 3D video dataset:
repeat
1. Estimate the latent variables H for all samples by fixing model parameters {ω, w, b}.
2. Optimize {w, b} given the CNN model parameters ω and the input sample segments
indicated by H :
2.1 Calculate φ(X ; ω, H ) by forwarding the neural network with ω.
2.2 Optimize {w, b} via:
∂L
w := w − αw,b ∗ ∂w by Eq. (10.13);
∂L
b := b − αw,b ∗ ∂b by Eq. (10.14);
3. Optimize ω given {w, b} and H :
3.1 Calculate κi j , κi and φi for L 2 , or calculate φ̄ω for L 3 .
3.2 Optimize the parameters ω of the spatiotemporal CNNs:
ω := ω − αω ∗ ∂∂ωL by Eq. (10.17) or (10.18).
until L in (10.5) or (10.7) converges.
10.4 Learning Algorithm 149
10.4.2 Model Pretraining
Parameter pretraining followed by fine-tuning is an effective method of improving

performance in deep learning, especially when the training data are scarce. In the
literature, there are two popular solutions, i.e., unsupervised pretraining on unlabeled
data [28] and supervised pretraining for an auxiliary task [23]. The latter usually
requires that the data formation (e.g., image) for parameter pretraining be exactly
the same as that (e.g., image) for fine-tuning.
In our approach, we suggest an alternative solution for 3D human activity recog-
nition. Although collecting RGB-D videos of human activities is expensive, a large
number of 2D activity videos can be easily obtained. Consequently, we first apply the
supervised pretraining using a large number of 2D activity videos and then fine-tune
the CNN parameters to train the 3D human activity models.
In the pretraining step, the CNN parameters are randomly initialized at the begin-
ning. We segment each input 2D video equally into M parts without estimating
its latent variables. Because the annotated 2D activity videos are large, we simply
employ the soft-max classifier with the CNNs and learn the parameters using the
backpropagation method.
The 3D and 2D convolutional kernels obtained in pretraining are only for the gray
channel. Thus, after pretraining, we duplicate the dimension of the 3D convolutional
kernels in the first layer and initialize the parameters of the depth channel by the
parameters of the gray channel, which allows us to borrow the features learned from
the 2D videos while directly learning the higher level information from the specific
3D activity dataset. For the fully connected layer, we set its parameters as random
values.
We summarize the overall learning procedure in Algorithm 2.
10.4.3 Inference
Given an input video xi , the inference task aims to recognize the category of the
activity, which can be formulated as the minimization of Fy (xi , ω, h) with respect
to the activity label y and the latent variables h,
(y ∗ , h ∗ ) = arg max{Fy (xi , ω, h) = w Ty φ(xi ; ω, h) + b y }. (10.19)

(y,h)
where {w y , b y } denotes the parameters of the max-margin classifier for the activity
category y. Note that the possible values for y and h are discrete. Thus, the problem
above can be solved by searching across all the labels y(1 ≤ y ≤ C) and calculating
the maximum Fy (xi , ω, h) by optimizing h. To find the maximum of Fy (xi , ω, h), we
enumerate all possible values of h and calculate the corresponding Fy (xi , ω, h) via
forward propagation. Because the forward propagations decided by different h are

independent, we can parallelize the computation via GPU to accelerate the inference
process.
10.5 Experiments
To validate the advantages of our model, experiments are conducted on several chal-
lenging public datasets, i.e., the CAD-120 dataset [21], the SBU Kinect Interaction
dataset [29], and a larger dataset newly created by us, namely, the Office Activity
(OA) dataset. Moreover, we introduce a more comprehensive dataset in our experi-
ments by combining five existing datasets of RGB-D human activity. In addition to
demonstrating the superior performance of the proposed model compared to other
state-of-the-art methods, we extensively evaluate the main components of our frame-
work.
10.5.1 Datasets and Setting
The CAD-120 dataset comprises 120 RGB-D video sequences of humans performing
long daily activities in 10 categories and has been widely used to test 3D human
activity recognition methods. These activities recorded via the Microsoft Kinect
sensor were performed by four different subjects, and each activity was repeated
three times by the same actor. Each activity has a long sequence of subactivities,
which vary significantly from subject to subject in terms of length, order, and the
way the task is executed. The challenges of this dataset also lie in the large variance
in object appearance, human pose, and viewpoint. Several sampled frames and depth
maps from these 10 categories are exhibited in Fig. 10.7a.
The SBU dataset consists of 8 categories of two-person interaction activities,
including a total of approximately 300 RGB-D video sequences, i.e., approximately
40 sequences for each interaction category. Although most interactions in this dataset
are simple, it is still challenging to model two-person interactions by considering the
following difficulties: (i) one person is acting, and the other person is reacting in
most cases; (ii) the average frame length of these interactions is short (ranging from
20 to 40 s), and (iii) the depth maps have noise. Figure 10.7b shows several sampled
frames and depth maps of these 8 categories.
The proposed OA dataset is more comprehensive and challenging than the existing
datasets, and it covers regular daily activities that take place in an office. To the best
of our knowledge, it is the largest activity dataset of RGB-D videos, consisting of
1180 sequences. The OA database is publicly accessible.1 Three RGB-D sensors (i.e.,
Microsoft Kinect cameras) are utilized to capture data from different viewpoints, and
1 http://vision.sysu.edu.cn/projects/3d-activity/.
10.5 Experiments 151
Fig. 10.7 Activity examples from the testing databases. Several sampled frames and depth maps
are presented. a CAD-120, b SBU, c OA1, and d OA2 show two activities of the same category
selected from the three databases
more than 10 actors are involved. The activities are captured in two different offices
to increase variability, and each actor performs the same activity twice. Activities
performed by two subjects who interact are also included. Specifically, the dataset
is divided into two subsets, each of which contains 10 categories of activities: OA1
(complex activities by a single subject) and OA2 (complex interactions by two sub-
jects). Several sampled frames and depth maps from OA1 and OA2 are shown in
Fig. 10.7c, d, respectively.
10.5.2 Empirical Analysis
Empirical analysis is used to assess the main components of the proposed deep struc-
tured model, including the latent structure, relaxed radius-margin bound, model pre-
training, and depth/grayscale channel. Several variants of our method are suggested
by enabling/disabling certain components. Specifically, we denote the conventional
3D convolutional neural network with the softmax classifier as Softmax + CNN, the
3D CNN with the SVM classifier as SVM + CNN, and the 3D CNN with the relaxed
radius-margin bound classifier as R-SVM + CNN. Analogously, we refer to our deep
Fig. 10.8 Test error rates 1

with/without incorporating Reconfigurable CNN
0.9
the latent structure into the CNN
0.8
deep model. The solid curve
represents the deep model 0.7
Test Error rate

trained by the proposed joint 0.6
component learning method,
0.5
and the dashed curve
represents the traditional 0.4
training method (i.e., using 0.3
standard backpropagation)
0.2
0.1
0
0 50 100 150 200 250 300
Train iteration
model as LCNN and then define Softmax + LCNN, SVM + LCNN, and R-SVM +
LCNN accordingly.
Latent Model Structure. In this experiment, we implement a simplified version
of our model by removing the latent structure and comparing it with our full model.
The simplified model is actually a spatiotemporal CNN model with both 3D and
2D convolutional layers, and this model uniformly segments the input video into M
subactivities. Without the latent variables to be estimated, the standard backpropa-
gation algorithm is employed for model training. We execute this experiment on the
CAD120 dataset. Figure 10.8 shows the test error rates with different iterations of the
simplified model (i.e., R-SVM + CNN) and the full version (i.e., R-SVM + LCNN).
Based on the results, we observe that our full model outperforms the simplified
model in both error rate and training efficiency. Furthermore, the structured models
with model pretraining, i.e., Softmax + LCNN, SVM + LCNN, R-SVM + LCNN,
achieve 14.4%/11.1%/12.4% better performance than the traditional CNN models,
i.e., Softmax + CNN, SVM + CNN, R-SVM + CNN, respectively. The results clearly
demonstrate the significance of incorporating the latent temporal structure to address
the large temporal variations in human activities.
Pretraining. To justify the effectiveness of pretraining, we discard the parame-
ters trained on the 2D videos and learn the model directly on the grayscale-depth
data. Then, we compare the test error rate of the models with/without pretraining.
To analyze the rate of convergence, we adopt the R-SVM + LCNN framework and
allow with/without pretraining to share the same learning rate settings for a fair
comparison. Using the CAD120 dataset, we plot the test error rates with increasing
iteration numbers during training in Fig. 10.9. The model using pretraining converges
in 170 iterations, while the model without pretraining requires 300 iterations, and the
model with pretraining converges to a much lower test error rate (9%) than that with-
out pretraining (25%). Furthermore, we also compare the performance with/without
pretraining using SVM + LCNN and R-SVM + LCNN. We find that pretraining is
10.5 Experiments 153
1
with pretraining
0.9
without pretraining
0.8
0.7
Test Error rate 0.6
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200 250 300
Train iteration
Fig. 10.9 Test error rates with/without pretraining
0.92 0.08 0.95 0.02 0.02

0.83 0.08 0.08
0.93 0.02 0.05
0.92 0.08
0.05 0.95
1.00
1.00 0.95 0.05
0.08 0.92 0.11 0.89

0.08 0.92
0.95 0.05
0.92 0.08
0.03 0.97
0.08 0.08 0.85
0.08 0.08 0.08 0.75 0.08 0.92
0.62 0.03 0.02 0.08 0.03 0.05 0.08 0.03 0.05 0.62 0.09 0.09 0.04 0.06 0.09
0.68 0.25 0.07 0.31 0.53 0.12 0.03
0.02 0.67 0.05 0.17 0.05 0.05 0.48 0.09 0.26 0.17
0.70 0.17 0.13 0.12 0.09 0.58 0.05 0.05 0.07 0.04
0.03 0.05 0.68 0.08 0.15 0.17 0.48 0.22 0.12
0.12 0.08 0.72 0.08 0.02 0.03 0.31 0.47 0.12 0.02 0.03
0.02 0.10 0.77 0.12 0.34 0.55 0.03 0.07
0.08 0.08 0.82 0.02 0.02 0.32 0.56 0.11
0.17 0.08 0.08 0.62 0.05 0.02 0.10 0.02 0.21 0.52 0.14
0.08 0.08 0.17 0.67 0.08 0.27 0.65
Fig. 10.10 Confusion matrices of our proposed deep structured model on the a CAD120, b SBU,
c OA1, and d OA2 datasets. It is evident that these confusion matrices all have a strong diagonal
with few errors
effective in reducing the test error rate. Actually, the test error rate with pretraining
is approximately 15% less than that without pretraining (Fig. 10.9).
Relaxed Radius-margin Bound. As described above, the training data for
grayscale-depth human activity recognition are scarce. Thus, for the last fully con-
nected layer, we adopt the SVM classifier by incorporating the relaxed radius-margin
bound, resulting in the R-SVM + LCNN model. To justify the role of the relaxed
radius-margin bound, Table 10.1 compares the accuracy of Softmax + LCNN, SVM
+ LCNN, and R-SVM + LCNN on all datasets with the same experimental settings.
Table 10.1 Average accuracy of all categories on four datasets with different classifiers
Softmax + LCNN (%) SVM + LCNN (%) R-SVM + LCNN (%)
CAD120 82.7 89.4 90.1
SBU 92.4 92.8 94.0
OA1 60.7 68.5 69.3
OA2 47.0 53.7 54.5
Merged_50 30.3 36.4 37.3
Merged_4 87.1 88.5 88.9
Table 10.2 Channel analysis of the three datasets. Average accuracy of all categories is reported
Grayscale (%) Depth (%) Grayscale + depth (%)
OA1 60.4 65.2 69.3
OA2 46.3 51.1 54.5
Merged_50 27.8 33.4 37.3
Merged_4 81.7 85.5 88.9
The max-margin-based classifiers (SVM and R-SVM) are particularly effective on

small-scale datasets (CAD120, SBU, OA1, OA2, Merged_50) (Fig. 10.10). On aver-
age, the accuracy of R-SVM + LCNN is on average 6.5% higher than that of Soft-
max + LCNN andis approximately 1% higher than that of SVM + LCNN. On the
Merged_4 dataset, the improvement of R-SVM + LCNN is incrementally evident,
as it is 1.8% higher than Softmax + LCNN. These results confirm our motivation to
incorporate the radius margin bound into our deep learning framework. Moreover,
when the model is learned without pretraining, R-SVM + LCNN gains about 4% and
8% accuracy improvement over Softmax + LCNN and SVM + LCNN, respectively.
Channel Analysis. To evaluate the contribution of the grayscale and depth data,
we execute the following experiment on the OA datasets: keeping only one data chan-
nel as input. Specifically, we first disable the depth channel and input the grayscale
data to perform the training/testing and then disable the grayscale channel and employ
the depth channel for training/testing. Table 10.2 proves that depth data can improve
the performance by large margins, especially on OA1 and Merged_50. The reason is
that large appearance variations exist in the grayscale data. In particular, our testing
is performed on new subjects, which further increases the appearance variations. In
contrast, the depth data are more reliable and have much smaller variations, which
is helpful in capturing salient motion information.
References 155
References
1. L. Lin, K. Wang, W. Zuo, M. Wang, J. Luo, L. Zhang, A deep structured model with radius-
margin bound for 3D human activity recognition. Int. J. Comput. Vis. 118(2), 256–273 (2016)
2. L. Xia, C. Chen, J.K. Aggarwal, View invariant human action recognition using histograms of
3d joints, in CVPRW, pp 20–27 (2012)
3. O. Oreifej, Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition from
depth sequences, in CVPR, pp. 716–723 (2013)
4. L. Xia, J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity recognition
using depth camera, in CVPR, pp. 2834–2841 (2013)
5. J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth
cameras, in: CVPR, pp. 1290–1297 (2012)
6. Y. Wang, G. Mori, Hidden part models for human action recognition: Probabilistic vs. max-
margin. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1310–1323 (2011)
7. J.M. Chaquet, E.J. Carmona, A. Fernandez-Caballero, A survey of video datasets for human
action and activity recognition. Comput. Vis. Image Underst. 117(6), 633–659 (2013)
8. Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, L.D. Jackel et al.,
Handwritten digit recognition with a back-propagation network (Adv. Neural Inf. Process,
Syst, 1990)
9. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks.
Science 313(5786), 504–507 (2006)
10. P. Wu, S. Hoi, H. Xia, P. Zhao, D. Wang, C. Miao, Online multimodal deep similarity learning
with application to image retrieval, in ACM Mutilmedia, pp. 153–162 (2013)
11. P. Luo, X. Wang, X. Tang, Pedestrian parsing via deep decompositional neural network, in
ICCV, pp. 2648–2655 (2013)
12. K. Wang, X. Wang, L. Lin, 3d human activity recognition with reconfigurable convolutional
neural networks, in ACM MM (2014)
13. S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition.
IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
14. S. Zhu, D. Mumford, A stochastic grammar of images. Found. Trends Comput. Graph. Vis.
2(4), 259–362 (2007)
15. P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection with discrim-
inatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645
(2010)
16. M.R. Amer, S. Todorovic, Sum-product networks for modeling activities with stochastic struc-
ture, in: CVPR, pp. 1314–1321 (2012)
17. L. Lin, X. Wang, W. Yang, J.H. Lai, Discriminatively trained and-or graph models for object
shape detection. IEEE Trans. Pattern Anal. Mach. Intelli. 37(5), 959–972 (2015)
18. M. Pei, Y. Jia, S. Zhu, Parsing video events with goal inference and intent prediction, in ICCV,
pp. 487–494 (2011)
19. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video
classification with convolutional neural networks, in CVPR (2014)
20. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional
neural networks. Adv. Neural Inf. Process. Syst. 1097–1105, (2012)
21. H.S. Koppula, R. Gupta, A. Saxena, Learning human activities and object affordances from
rgb-d videos. Int. J. Robot. Res. (IJRR) 32(8), 951–970 (2013)
22. F.J. Huang, Y. LeCun, Large-scale learning with svm and convolutional for generic object
categorization, in CVPR, pp. 284–291 (2006)
23. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object
detection and semantic segmentation, in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (2014)
24. V. Vapnik, Statistical Learning Theory (John Wiley and Sons, New York, 1998)
25. O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parameters for support
vector machines. Mach. Learn. 46(1–3), 131–159 (2002)
26. H. Do, A. Kalousis, Convex formulations of radius-margin based support vector machines, in
ICML (2013)
27. H. Do, A. Kalousis, M. Hilario, Feature weighting using margin and radius based error bound
optimization in svms, in Machine Learning and Knowledge Discovery in Databases, Lecture
Notes in Computer Science, vol 5781, Springer, Berlin Heidelberg, pp 315–329 (2009)
28. P S, K K, S C, Y L, Pedestrian detection with unsupervised multistage feature learning, in
CVPR (2013)
29. K. Yun, J. Honorio, D. Chattopadhyay, T.L. Berg, D. Samaras, Two-person interaction detec-
tion using body-pose features and multiple instance learning, in Computer Vision and Pattern
Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE (2012)

Human Centric Visual Analysis Deep Learning

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Human Centric Visual Analysis Deep Learning

Diunggah oleh

Hak Cipta:

Format Tersedia

Liang Lin · Dongyu Zhang · Ping Luo ·

Ping Luo Wangmeng Zuo

Human Centric Visual

Ping Luo Wangmeng Zuo

ISBN 978-981-13-2386-7 ISBN 978-981-13-2387-4 (eBook)

Human-centric visual analysis is regarded as one of the most fundamental problems

In the second part, we introduce tasks related to how to localize a person in an

Guangzhou, China Liang Lin

Part I Motivation and Overview

Part II Localizing Persons in Images

3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Part III Parsing Person in Detail

Part IV Identifying and Verifying Persons

Part V Higher Level Tasks

10.2.2 Latent Temporal Structure . . . . . . . . . . . . . . . . . . . . . . 137

Human-centric visual analysis is important in the field of computer vision. Typical

1.1 Neural Networks

The perceptron occupies a special place in the historical development of neural

© Springer Nature Singapore Pte Ltd. 2020 3

ŷ = arg max f (x, y) · w. (1.2)

1.1.2 Multilayer Perceptron

The multilayer perceptron (MLP) is a class of feedforward artificial neural networks

Fig. 1.1 Illustration of a typical neural network

of sigmoid neurons is defined as

where the sum is overall weights, w j , and ∂out put

be solved by the single-layer perceptron. In this case, we need to use an MLP to

1.1.3 Formulation of Neural Network

z (l+1) = W (l) a (l) + b(l) ,

1.2 New Techniques in Deep Learning

1.2.1 Batch Normalization

BN is a technique for improving the performance and stability of neural networks.

• Easier to initialize: Weight initialization can be difficult, particularly when creating

y (k) = γ (k) x̂ (k) + β (k) . (1.9)

Algorithm 1.1: Training and Inference with Batch Normalization

yi ← γ xˆi + β ≡ B Nγ,β (xi )

Consider a minibatch B of size m. Since the normalization is applied to each

as the BN. We present the BN transform in Algorithm 1. In this algorithm,  is a

1.2.2 Batch Kalman Normalization

Although the significance of BN has been demonstrated in many previous works,

1.2.2.1 Batch Kalman Normalization Method

μ̂k|k = μ̂k|k−1 + q k (z k − μ̂k|k−1 ). (1.17)

Algorithm 1.2: Training and Inference with Batch Kalman Normalization

Similarly, the estimations of the covariances can be achieved by calculating

1. W. Guangrun, P. Jiefeng, L. Ping, W. Xinjiang, L. Liang, Batch kalman normalization: towards

Abstract The research of human-centric visual analysis has achieved considerable

2.1 Face Detection

© Springer Nature Singapore Pte Ltd. 2020 15

procedure, Zhang et al. [12] leveraged a cascaded multitask architecture to enhance

2.2 Facial Landmark Localization

2.2.1 Conventional Approaches

2.2.2 Deep-Learning-Based Models

Despite their acknowledged successes, all the aforementioned conventional

2.3 Pedestrian Detection

Pedestrian detection is a subtask of general object detection where pedestrians, rather

2.3.1 Benchmarks for Pedestrian Detection

2.3.2 Pedestrian Detection Methods

2.3.2.1 Two-Stage Architectures of Pedestrian Detection

2.3.2.2 Deep Convolutional Architectures of Pedestrian Detection

2.4 Human Segmentation and Clothes Parsing

1. T. Sakai, M. Nagao, and T. Kanade, Computer Analysis and Classification of Photographs of

Finding people in images/videos is one of the fundamental problems of computer

3.1 Facial Landmark Machines

© Springer Nature Singapore Pte Ltd. 2020 29

as the BN. We present the BN transform in Algorithm 1. In this algorithm, is a

Resize 42424 52424 52424 52424 52424 12424