Anda di halaman 1dari 6

2016 IEEE Intelligent Vehicles Symposium (IV)

Gothenburg, Sweden, June 19-22, 2016

A Closer Look at Faster R-CNN for Vehicle Detection


Quanfu Fan1 and Lisa Brown1 and John Smith1

Abstract— Faster R-CNN achieves state-of-the-art perfor-


mance on generic object detection. However, a simple ap-
plication of this method to a large vehicle dataset performs
unimpressively. In this paper, we take a closer look at this
approach as it applies to vehicle detection. We conduct a wide
range of experiments and provide a comprehensive analysis of
the underlying structure of this model. We show that through
suitable parameter tuning and algorithmic modification, we
can significantly improve the performance of Faster R-CNN on
vehicle detection and achieve competitive results on the KITTI Fig. 1. Examples from the KITTI car dataset. The data present large
variations in appearance and camera viewpoint, and severe occlusions.
vehicle dataset.We believe our studies are instructive for other
researchers investigating the application of Faster R-CNN to
their problems and datasets.
R-CNN highly depends on training and test scaling. We
I. I NTRODUCTION also examine the localization and recognition capabilities of
Vehicle detection is of central significance to many ap- Faster R-CNN in great detail and we have used this result
plications such as public safety and security, surveillance, to design a new iterative approach that further improves
intelligent traffic control and autonomous driving. It is a performance for this dataset. Our main contributions are:
challenging problem due to the large variations in appearance (1) deeper understanding of how to tune and modify Faster-
and camera viewpoint, and severe occlusions (See Fig. 1 for RCNN for specific applications and datasets;
examples). Weather and lighting conditions are additional (2) significant improvement on default performance of
compounding issues. Previous work on vehicle detection Faster R-CNN for vehicle detection on the KITTI
has focused on special-purpose designs, e.g. hand-crafted dataset.
features and part and occlusion modeling [20]. Although
these proposed methods perform reasonably well, current top II. R ELATED W ORK
methods are all based on deep neural nets. For many years, boosting techniques were used success-
Recent years have seen an explosion of deep learning fully for real-time vehicle detection[6], [5] and up until
approaches in object detection. These approaches have lever- a few years ago, the state-of-the-art in vehicle detection
aged significantly larger datasets than ever before and often was achieved by deformable part models [12], [16], [24],
achieve excellent results on datasets with large number [9], [10]. A survey of vehicle detection from that time
of classifications or multiple tasks. Top results have been can be found in [20]. False positive are often filtered out
achieved for challenging datasets such as ImageNet and using road and vehicle pose models combined with tracking
VOC, and specialized datasets such as KITTI focusing information.
on vehicle and pedestrian detection [1]. Among these top However, in recent years, deep models have proven to
contenders on object detection, is faster R-CNN proposed be more accurate for classification and detection across al-
by Ren [17]. most all object types. Using Convolutional Neural Networks
However its not clear if such an approach can perform as (CNNs) has enabled a data-driven approach to detection,
well as other top methods for specific applications such as minimizing the work to design features, model objects and
vehicle detection. Actually, our model trained using default the need to rely on additional sensors.
parameters for faster R-CNN on the KITTI car dataset per- The KITTI dataset has provided a rich practical and large
formed only moderately well compared to the top contenders dataset for autonomous driver applications that has ignited
in the KITTI car competition[1]. a burst of research to apply CNNs to this problem. Over
In this work, we conduct extensive experiments using 20 competitive methods have been developed and evaluated
Faster R-CNN on the KITTI dataset including a study of on KITTI in the last two years including approaches using
both training and test scaling, object proposing, localization location relaxation, dense neural patterns and regionlets,
vs. recognition, and iterative training. We try to analyze the clustering appearance patterns, data-driven 3D voxel pat-
advantages of the approach as well as its limitations. From terns, and integrating context and occlusion for car detection
our studies, we have shown that the performance of faster [26], [22], [15], [14], [23]
1 Quanfu Fan, Lisa Brown and John Smith are with IBM T. J. Watson
The best published result on the KITTI dataset is the fourth
Research Center, Yorktown Heights, NY 10532 qfan@us.ibm.com, contender, X.Chen NIPS 2015 [3]. Their work relies on the
lisabr@us.ibm.com, jsmith@us.ibm.com stereo imagery to estimate 3D bounding boxes. The next

978-1-5090-1821-5/16/$31.00 ©2016 IEEE 124


best published work is that of L. Huang , DenseBox[11].
DenseBox uses a single unified fully convolutional network
to perform detection combining both bounding box pre-
diction and object classification into one framework. They
augment their method with landmark information to improve
performance. They achieve excellent results on the KITTI
dataset, and have continued to improve their performance.
The main drawback is its computational cost. By using a
very small model (around 1M in size) Huang was able to Fig. 2. The network structure of Faster R-CNN. The region proposal
build a lite version of DenseBox that can run one image in network (RPN) and the object classifier share fully convolutional layers,
which are trained jointly. The RPN behaves as an attention director,
less than 0.05 second. determining the optimal bounding boxes across a wide range of scales and
aspect ratios to be evaluated for object classification. In other words, the
III. OVERVIEW OF FASTER R-CNN RPN tells the classifier where to look.
Historically object detection was performed by exhaus-
tively deploying a two-class object classifier in a window-
the benchmark consists of 12,000 images with 40,000 labeled
based search. All windows across all viable scales and aspect
objects including cars, trucks, vans, pedestrians, cyclists and
ratios, which returned a positive object classification were
trams. Images are color and have a resolution of 1242x375.
then further pruned using non-maximal suppression. This
Each object is accurately labeled with a 3D bounding box.
method was later improved using various pre-filters called
The benchmark provides a Matlab evaluation toolkit.
object proposal algorithms. Examples include minimizing
In our experiments, we use the subset of this dataset that
search locations using Branch & Bound[13] and object size
includes cars. This consists of 7481 training images with
limits from calibration information[2], grouping super-pixels
28,742 labeled cars (See Fig. 1). The distribution of the
including Selective Search[21] and pre-selecting windows
heights of these cars is shown in Figure 3. The majority
based on an objectness criteria such as in Spatial Pyramid
of cars have an image height of 40 to 80 pixels. The dataset
Pooling[8] and Edge Boxes[25]. Pre-filtering has improved
is divided into 3 categories: Easy, Moderate and Hard, based
efficiency, in particular by sharing convolutions across pro-
on the minimum bounding box height (40,25, and 25 pixels,
posals [1], but it is still a significant bottleneck in run-
respectively) and maximum occlusion (fully visible, partly
time detection computational cost. Faster R-CNN improves
occluded, difficult to see) and truncation level (15%,30%,
upon this methodology by combining features of a fully
50%).
convolutional network to perform both region proposals and
object detection. V. E XPERIMENTAL R ESULTS
In Faster R-CNN, a region proposal network (RPN) shares
convolutional layers with the object detection network, sig- A. Experimental Setup
nificantly reducing the proposal cost . A few additional Data We split the 7481 images in the KITTI development
convolutional layers are used to regress region bounds and dataset into two parts, 2/3 for training and 1/3 for testing,
objectness scores at each location. While this algorithmic which leads to 11042 valid training samples and 3105 test
design change has lead to significant speed-up, it has also ones(Table I). The video sequences of the two subsets do not
proven to improve object detection performance. Faster R- overlap. Table I also shows the distribution of samples into
CNN achieved the highest accuracy on both PASCAL VOC the three categories: easy, moderate and hard.
2007 and 2012 and was the basis for the winning entries Model Parameters Faster R-CNN scales all the samples
in ImagNet detection and localization at ILSVRC 2015 and in training to the same size based on the length of the shorter
COCO detection at the COCO 2015 competition [18]. side of an image. In addition, the longer side of the image,
Figure 2 shows the network structure of the Faster R- if too large, is capped to be at a given size and the image’s
CNN framework. Both the region proposal network and the aspect ratio is maintained. In our case, since all the images
object classifier share fully convolutional layers. These layers are the same size, one parameter each for training and testing
are effectively trained jointly. The region proposal network is sufficient to specify the image size. We refer them to
behaves as an attention director, determining the optimal training scale (TR S) and test scale (TE S) here, respectively.
bounding boxes across a wide range of scales and aspect Note that by default, Faster R-CNN set the test scale to
ratios to be evaluated for object classification. In other words, be the same as the training scale, i.e. 1000 pixels. Another
the RPN tells the classifier where to look. relevant parameter we look at is the number of proposals
used in classification, which has been shown relevant to both
IV. KITTI DATASET the performance and efficiency of Faster R-CNN.
The KITTI object detection and object orientation esti- Training We fine tuned Faster R-CNN models on top of
mation benchmark dataset[1] was collected using an au- VGG [19] that is pre-trained on ImageNet. Instead of training
tonomous fully calibrated driving platform and was used a binary classifer for car only, we learned multiple classifiers
to collect high resolution stereo data, consecutive frames, for all the object labels in the dataset including people,
visual odometry and 3D laser scans. For object detection, but only consider cars in our analysis. We adopted average

125
Data Resolution Easy Moderate Hard Total
Train 1242x375 4247 (38%) 3331 (30%) 3464 (31%) 11042
Test 1724 (55%) 844 (27%) 537 (17%) 3105
TABLE I
T RAINING AND TEST DATA USED IN OUR EXPERIMENTS .

precision (AP) as our performance metric and calculated AP


using the tool provided in [1].
Next we take a deep analysis of how these three param-
Fig. 4. Performance of Faster R-CNN under different training scales.
eters, TR S, TE S and the number of proposals, affect the
performance of Faster R-CNN. Unless specified, a parameter
is always set to its default in our experiments.
almost as well as multi-scale detection. However, such a
finding has only been verified on the VOC dataset in the
case of generic object detection. We would like to find out
whether or not this claim is still true on the KITTI benchmark
for vehicle detection.
To avoid the prohibitively high computational cost im-
posed by the multi-scale setting, we train a model at one scale
only, but test them at multiple scales. In our experiments, we
built 4 models separately at a scale of 800, 1000, 1200 and
1500, and evaluated them at 6 different scales ranging from
800 to 2500. The maximum test scale (i.e. 2500) is about
twice as large as the original size of a test image.
From Fig. 5, we can clearly see that the performance
of Faster R-CNN is dependent on the test scale in all the
Fig. 3. Car distribution over height on the KITTI development dataset
models, suggesting that scale invariance be not inherent
in ConvNets, but data dependent. In general, applying a
B. What training scale is appropriate? test scale less than the training scale is not a good idea,
always yielding worse performance. On the other hand, a
We first trained Faster R-CNN at its default settings
larger test scale, if chosen appropriately, can lead to perfor-
where the training scale was set to 1000 pixels. As shown in
mance improvement, especially for the moderate and hard
Table II, it didn’t perform well, only achieving 64.02% on the
categories. As explained earlier, the down-sampling effect
moderate car examples while state of the art results reported
in ConvNets prevents small objects from obtaining salient
on the KITTI website are 90.03%. However we believe this
features. Conversely, scaling up an image helps counter this
low performance can be explained. Faster R-CNN pulls out
issue, thus likely improving the results overall.
features for a proposal based on its projected region on a
very deep layer, i.e. conv-5, which has been down sampled While theoretically it is hard to determine an optimal test
by a factor of 32 in VGG against the original input size. This scale, we can observe in Fig. 5 that the performance gain
issue leaves for small objects little hope of receiving good lasts longer at smaller training scales, and disappears around
features for classification. When using a training scale of a scale of 2000. We speculate that this relates to the size
1000, it is equivalent to further scaling down the images by distribution of cars shown in 3. Since the aspect ratio of
20% in the training, which only makes the problem worse. an image is maintained in both training and testing, such a
We further tested Faster R-CNN at other training scales, distribution is shift-invariant. In the case of applying a large
ranging from 800 to 1800. As expected, when the test scale test scale like 2000, if the training scale is not well adjusted
increases, the performance of Faster R-CNN continues to accordingly, the distribution of the training cars could derail
improve stably on all three categories (Fig 4), and the trend too much from that of the testing cars, which results in a
does not seem to stop after the scale of 1800. This implies performance drop.
that better performance may be further achieved beyond a
D. How many proposals are needed?
scale of 1800. However we used a training scale of 1500 for
most of our analysis below for efficiency consideration. One big advantage of Faster R-CNN lies in its RPN,
which provides proposals of high quality at a very small
C. Does the test scale matter? cost. The number of proposals used in [18] is 300, producing
Scale invariance is a highly desired property for object the best performance on the VOC benchmark. We conducted
detection. A general practice to achieve scale invariance is sensitivity analysis of this parameter over performance at a
to train and test a model at multiple scales. Previous work training scale of 1500. As indicated by Fig. 6, with very few
such as [8], [7] observed that single-scale detection performs proposals (i.e. 25), Faster R-CNN performs as comparable as

126
E. Localization v.s. Recognition: which is better?
Faster R-CNN combines object localization (i.e. the re-
gressor) and recognition(i.e. the classifier) into one network.
In this section, We examine the object regressor and classifier
separately in order to better understand the limitations of the
approach.
Let G = {gi |i = 1 . . . n} be a set of objects in an image.
Also let ri0 be a RPN proposal and ri be its output from
the regressor. We call ri0 a valid proposal (VP) if there
exists at least one object g j such that IoU(ri0 , g j ) ≥ 0.5
where IoU(., .) represents the intersection over union be-
tween two regions.With that, we further call a valid proposal
ri0 localizable (LP) if IoU(ri , g j ) ≥ 0.7 where 0.7 is the
threshold used in the KITTI evaluator. Similarly, ri0 is termed
as recognizable (RP) if s(r0i ) >= 0.5 where s(.) is the
Fig. 5. Performance of Faster R-CNN under different test scales. The
classification score. Based on these notations, we calculate
#LPs
dotted line at each plot marks the training scale TR S used in the model: a localization rate lr by lr= #V Ps and a recognition rate
a) 1000; b) 1000; c) 1200; d) 1500. #RPs
by rr= #V Ps . Finally, we define a conversion rate cr as
cr= #(LPs∩RPs)
#V Ps , which basically measures the overall ability
of a model in transferring a proposal to correct detection.
with a much larger number (i.e. 300), and even slightly better.
As expected, Faster R-CNN demonstrates great conversion
We hypothesized that the the majority of top candidates from
capability on the easy examples, achieving an accuracy of
RPN spatially correspond to car objects. This is confirmed by
85% to 95%, but it has difficulty handling hard examples.
the IoU-to-recall plotting (Fig. 7), which shows high recalls
While the localization rate varies with category, it tends to be
on the easy and moderate examples are achieved with the
less dependent on the number of proposals. Conversely, the
top 50 proposals.
recognition capability decreases in general as more proposals
The low performance of RPN on hard examples is un-
are used, suggesting that classification is the part that needs
derstandable as most of them experience severe occlusions.
more improvement. In fact, the localization and recognition
However, when there are more hits on hard cars with the
are highly related in Faster R-CNN, as the classification
increase of proposals, we do not observe better detection as
relies on features extracted from the initial proposal, not the
expected in Fig. 6. Further analysis reveals that this is related
new one transferred by the regressor. It makes sense that
to the conversion power of a model, which we will discuss
better localization leads to better detection. Based on this,
in the next section.
we propose a new training scheme to improve localization
and recognition in an iterative way.

Fig. 6. Faster R-CNN Performance under different numbers of proposals.

Fig. 8. Localization, recognition and conversion rates under different


numbers of proposals. TR S=1500 and TE S=1500.
Fig. 7. Recalls under different numbers of proposals when IoU = 0.5.

127
Methods Easy Moderate Hard Test Scale 1500 1800
Our default 83.47 63.13 52.36 100 200 300 100 200 300
Our Best 95.14 83.73 71.22 Time 0.32 0.35 0.36 0.45 0.47 0.47
KITTI Best Published [4] 93.04 88.64 79.10
KITTI Best Reported [1] 91.19 90.03 81.69 TABLE III
RUNNING TIME OF FASTER R-CNN MEASURED IN SECONDS
TABLE II
(TR S=1500)
O UR RESULTS AND STATE - OF - THE - ART RESULTS ON THE KITTI
DATASET

G. Running Time

F. Does Iterative Training Help? We benchmarked the model that produces our best results
at two test scales (1500 and 1800) under different number
of proposals, on a 32-core server with 3.1GHz, 13G RAM
Faster R-CNN itself is an iterative method that refines
and a Tesla K40 GPU cards. Faster R-CNN can run about
localization and classification alternatively in two stages.
2fps at an image size of 1800x543 (Table III). Also the
We extended this idea by doing more refinements with 3
computational overhead imposed by the number of proposals
additional stages. Specifically, we obtain the RPN proposals
is negligible compared to the cost from a large image size.
using the model trained at the 2nd stage and feed them into
Faster R-CNN again for adjusting both the convolutional
layers and fully connected (FC) layers (Stage 3). The refined VI. C ONCLUSION
network is then fixed and the RPN is re-trained (Stage 4). We have conducted a wide range of experiments and
Finally we repeat Stage 3, but only re-tuning the FC layers provided a comprehensive analysis of the performance of
(Stage 5). In such an iterative way, we hope to improve the Faster R-CNN on the task of vehicle detection, including
localization capability of RPN, which in turn helps improve an analysis of both training and test scale size, the number of
the classification of Faster R-CNN. proposals, localization vs. recognition, and iterative training.
We have also shown how to tune the approach suitably
to greatly improve performance on the challenging KITII
benchmark.

R EFERENCES
[1] Kitti benchmark, http://www.cvlibs.net/datasets/kitti/eval object.php.
[2] L. M. Brown, Q. Fan, and Y. Zhai. Self-calibration from vehicle
information. In Advanced Video and Signal Based Surveillance (AVSS),
2015 12th IEEE International Conference on, pages 1–6. IEEE, 2015.
[3] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and
R. Urtasun. 3d object proposals for accurate object class detection. In
NIPS, 2015.
[4] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and
R. Urtasun. 3d object proposals for accurate object class detection. In
Fig. 9. Performance comparisons between Faster R-CNN (dotted lines) C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,
and our proposed iterative version (solid lines). TR S=1500. Advances in Neural Information Processing Systems 28, pages 424–
432. Curran Associates, Inc., 2015.
[5] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids
for object detection. Pattern Analysis and Machine Intelligence, IEEE
Figure 9 compares the performance of Faster R- Transactions on, 36(8):1532–1545, 2014.
CNN (dotted lines) and our proposed iterative version of [6] R. Feris, R. Bobbitt, S. Pankanti, and M.-T. Sun. Efficient 24/7 object
the approach (solid lines) at different test scales. The iterative detection in surveillance videos. In Advanced Video and Signal Based
Surveillance (AVSS), 2015 12th IEEE International Conference on,
training scheme leads to an improvement of 2%-3% on the pages 1–6. IEEE, 2015.
easy and moderate examples, demonstrating the effectiveness [7] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep
of more refinements in the training process. At a test scale convolutional networks for visual recognition. In Computer Vision–
of 1800, the model trained at a scale of 1500 yields the ECCV 2014, pages 346–361. Springer, 2014.
best performance among all our experiments: 95.14% (easy), [9] M. Hejrati and D. Ramanan. Analyzing 3d objects in cluttered images.
In Advances in Neural Information Processing Systems, pages 593–
83.73% (moderate) and 71.22% (hard) (see Table II). 601, 2012.
[10] D. Held, J. Levinson, and S. Thrun. A probabilistic framework for
The best result on the KITTI benchmark is currently car detection in images using context and scale. In Robotics and
held by an anonymous submission called Meow. The best Automation (ICRA), 2012 IEEE International Conference on, pages
published result is held by the fourth best competitor. We 1628–1634. IEEE, 2012.
[11] L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark
note, that because the test ground truth is not available to localization with end to end object detection. CoRR, abs/1509.04874,
us, our results are evaluated on part of the training data. It 2015.
is therefore not a precise comparison to other results. With [12] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil,
M. Andriluka, R. Cheng-Yue, F. Mujica, A. Coates, et al. An empirical
this caveat in mind, our performance is better on the easy evaluation of deep learning on highway driving. arXiv preprint
examples and worse than the moderate/hard examples. arXiv:1504.01716, 2015.

128
[13] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Efficient subwin-
dow search: A branch and bound framework for object localization.
Pattern Analysis and Machine Intelligence, IEEE Transactions on,
31(12):2129–2142, 2009.
[14] B. Li, T. Wu, and S.-C. Zhu. Integrating context and occlusion for
car detection by hierarchical and-or model. In ECCV, 2014.
[15] C. Long, X. Wang, G. Hua, M. Yang, and Y. Lin. Accurate object
detection with location relaxation and regionlets relocalization. In
Asian Conference on Computer Vision, 2014.
[16] H. T. Niknejad, A. Takeuchi, S. Mita, and D. McAllester. On-
road multivehicle tracking using deformable object model and particle
filter with improved likelihood estimation. Intelligent Transportation
Systems, IEEE Transactions on, 13(2):748–758, 2012.
[17] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. In C. Cortes,
N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances
in Neural Information Processing Systems 28, pages 91–99. Curran
Associates, Inc., 2015.
[18] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards
real-time object detection with region proposal networks. CoRR,
abs/1506.01497, 2015.
[19] K. Simonyan and A. Zisserman. Very deep convolutional networks
for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[20] S. Sivaraman and M. M. Trivedi. Looking at vehicles on the road:
A survey of vision-based vehicle detection, tracking, and behavior
analysis. Intelligent Transportation Systems, IEEE Transactions on,
14(4):1773–1795, 2013.
[21] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders.
Selective search for object recognition. International journal of
computer vision, 104(2):154–171, 2013.
[22] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object
detection. In T-PAMI, 2015.
[23] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d voxel
patterns for object category recognition. In IEEE Conference on
Computer Vision and Pattern Recognition, 2015.
[24] J. J. Yebes, L. M. Bergasa, R. Arroyo, and A. Lazaro. Supervised
learning and evaluation of kitti’s cars detector with dpm. In Intelligent
Vehicles Symposium Proceedings, 2014 IEEE, pages 768–773. IEEE,
2014.
[25] C. L. Zitnick and P. Dollár. Edge boxes: Locating object propos-
als from edges. In Computer Vision–ECCV 2014, pages 391–405.
Springer, 2014.
[26] W. Y. Zou, X. Wang, M. Sun, and Y. Lin. Generic object detection
with dense neural patterns and regionlets. In British Machine Vision
Conference, 2014.

129

Anda mungkin juga menyukai