125
Data Resolution Easy Moderate Hard Total
Train 1242x375 4247 (38%) 3331 (30%) 3464 (31%) 11042
Test 1724 (55%) 844 (27%) 537 (17%) 3105
TABLE I
T RAINING AND TEST DATA USED IN OUR EXPERIMENTS .
126
E. Localization v.s. Recognition: which is better?
Faster R-CNN combines object localization (i.e. the re-
gressor) and recognition(i.e. the classifier) into one network.
In this section, We examine the object regressor and classifier
separately in order to better understand the limitations of the
approach.
Let G = {gi |i = 1 . . . n} be a set of objects in an image.
Also let ri0 be a RPN proposal and ri be its output from
the regressor. We call ri0 a valid proposal (VP) if there
exists at least one object g j such that IoU(ri0 , g j ) ≥ 0.5
where IoU(., .) represents the intersection over union be-
tween two regions.With that, we further call a valid proposal
ri0 localizable (LP) if IoU(ri , g j ) ≥ 0.7 where 0.7 is the
threshold used in the KITTI evaluator. Similarly, ri0 is termed
as recognizable (RP) if s(r0i ) >= 0.5 where s(.) is the
Fig. 5. Performance of Faster R-CNN under different test scales. The
classification score. Based on these notations, we calculate
#LPs
dotted line at each plot marks the training scale TR S used in the model: a localization rate lr by lr= #V Ps and a recognition rate
a) 1000; b) 1000; c) 1200; d) 1500. #RPs
by rr= #V Ps . Finally, we define a conversion rate cr as
cr= #(LPs∩RPs)
#V Ps , which basically measures the overall ability
of a model in transferring a proposal to correct detection.
with a much larger number (i.e. 300), and even slightly better.
As expected, Faster R-CNN demonstrates great conversion
We hypothesized that the the majority of top candidates from
capability on the easy examples, achieving an accuracy of
RPN spatially correspond to car objects. This is confirmed by
85% to 95%, but it has difficulty handling hard examples.
the IoU-to-recall plotting (Fig. 7), which shows high recalls
While the localization rate varies with category, it tends to be
on the easy and moderate examples are achieved with the
less dependent on the number of proposals. Conversely, the
top 50 proposals.
recognition capability decreases in general as more proposals
The low performance of RPN on hard examples is un-
are used, suggesting that classification is the part that needs
derstandable as most of them experience severe occlusions.
more improvement. In fact, the localization and recognition
However, when there are more hits on hard cars with the
are highly related in Faster R-CNN, as the classification
increase of proposals, we do not observe better detection as
relies on features extracted from the initial proposal, not the
expected in Fig. 6. Further analysis reveals that this is related
new one transferred by the regressor. It makes sense that
to the conversion power of a model, which we will discuss
better localization leads to better detection. Based on this,
in the next section.
we propose a new training scheme to improve localization
and recognition in an iterative way.
127
Methods Easy Moderate Hard Test Scale 1500 1800
Our default 83.47 63.13 52.36 100 200 300 100 200 300
Our Best 95.14 83.73 71.22 Time 0.32 0.35 0.36 0.45 0.47 0.47
KITTI Best Published [4] 93.04 88.64 79.10
KITTI Best Reported [1] 91.19 90.03 81.69 TABLE III
RUNNING TIME OF FASTER R-CNN MEASURED IN SECONDS
TABLE II
(TR S=1500)
O UR RESULTS AND STATE - OF - THE - ART RESULTS ON THE KITTI
DATASET
G. Running Time
F. Does Iterative Training Help? We benchmarked the model that produces our best results
at two test scales (1500 and 1800) under different number
of proposals, on a 32-core server with 3.1GHz, 13G RAM
Faster R-CNN itself is an iterative method that refines
and a Tesla K40 GPU cards. Faster R-CNN can run about
localization and classification alternatively in two stages.
2fps at an image size of 1800x543 (Table III). Also the
We extended this idea by doing more refinements with 3
computational overhead imposed by the number of proposals
additional stages. Specifically, we obtain the RPN proposals
is negligible compared to the cost from a large image size.
using the model trained at the 2nd stage and feed them into
Faster R-CNN again for adjusting both the convolutional
layers and fully connected (FC) layers (Stage 3). The refined VI. C ONCLUSION
network is then fixed and the RPN is re-trained (Stage 4). We have conducted a wide range of experiments and
Finally we repeat Stage 3, but only re-tuning the FC layers provided a comprehensive analysis of the performance of
(Stage 5). In such an iterative way, we hope to improve the Faster R-CNN on the task of vehicle detection, including
localization capability of RPN, which in turn helps improve an analysis of both training and test scale size, the number of
the classification of Faster R-CNN. proposals, localization vs. recognition, and iterative training.
We have also shown how to tune the approach suitably
to greatly improve performance on the challenging KITII
benchmark.
R EFERENCES
[1] Kitti benchmark, http://www.cvlibs.net/datasets/kitti/eval object.php.
[2] L. M. Brown, Q. Fan, and Y. Zhai. Self-calibration from vehicle
information. In Advanced Video and Signal Based Surveillance (AVSS),
2015 12th IEEE International Conference on, pages 1–6. IEEE, 2015.
[3] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and
R. Urtasun. 3d object proposals for accurate object class detection. In
NIPS, 2015.
[4] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and
R. Urtasun. 3d object proposals for accurate object class detection. In
Fig. 9. Performance comparisons between Faster R-CNN (dotted lines) C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,
and our proposed iterative version (solid lines). TR S=1500. Advances in Neural Information Processing Systems 28, pages 424–
432. Curran Associates, Inc., 2015.
[5] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids
for object detection. Pattern Analysis and Machine Intelligence, IEEE
Figure 9 compares the performance of Faster R- Transactions on, 36(8):1532–1545, 2014.
CNN (dotted lines) and our proposed iterative version of [6] R. Feris, R. Bobbitt, S. Pankanti, and M.-T. Sun. Efficient 24/7 object
the approach (solid lines) at different test scales. The iterative detection in surveillance videos. In Advanced Video and Signal Based
Surveillance (AVSS), 2015 12th IEEE International Conference on,
training scheme leads to an improvement of 2%-3% on the pages 1–6. IEEE, 2015.
easy and moderate examples, demonstrating the effectiveness [7] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep
of more refinements in the training process. At a test scale convolutional networks for visual recognition. In Computer Vision–
of 1800, the model trained at a scale of 1500 yields the ECCV 2014, pages 346–361. Springer, 2014.
best performance among all our experiments: 95.14% (easy), [9] M. Hejrati and D. Ramanan. Analyzing 3d objects in cluttered images.
In Advances in Neural Information Processing Systems, pages 593–
83.73% (moderate) and 71.22% (hard) (see Table II). 601, 2012.
[10] D. Held, J. Levinson, and S. Thrun. A probabilistic framework for
The best result on the KITTI benchmark is currently car detection in images using context and scale. In Robotics and
held by an anonymous submission called Meow. The best Automation (ICRA), 2012 IEEE International Conference on, pages
published result is held by the fourth best competitor. We 1628–1634. IEEE, 2012.
[11] L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark
note, that because the test ground truth is not available to localization with end to end object detection. CoRR, abs/1509.04874,
us, our results are evaluated on part of the training data. It 2015.
is therefore not a precise comparison to other results. With [12] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil,
M. Andriluka, R. Cheng-Yue, F. Mujica, A. Coates, et al. An empirical
this caveat in mind, our performance is better on the easy evaluation of deep learning on highway driving. arXiv preprint
examples and worse than the moderate/hard examples. arXiv:1504.01716, 2015.
128
[13] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Efficient subwin-
dow search: A branch and bound framework for object localization.
Pattern Analysis and Machine Intelligence, IEEE Transactions on,
31(12):2129–2142, 2009.
[14] B. Li, T. Wu, and S.-C. Zhu. Integrating context and occlusion for
car detection by hierarchical and-or model. In ECCV, 2014.
[15] C. Long, X. Wang, G. Hua, M. Yang, and Y. Lin. Accurate object
detection with location relaxation and regionlets relocalization. In
Asian Conference on Computer Vision, 2014.
[16] H. T. Niknejad, A. Takeuchi, S. Mita, and D. McAllester. On-
road multivehicle tracking using deformable object model and particle
filter with improved likelihood estimation. Intelligent Transportation
Systems, IEEE Transactions on, 13(2):748–758, 2012.
[17] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. In C. Cortes,
N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances
in Neural Information Processing Systems 28, pages 91–99. Curran
Associates, Inc., 2015.
[18] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards
real-time object detection with region proposal networks. CoRR,
abs/1506.01497, 2015.
[19] K. Simonyan and A. Zisserman. Very deep convolutional networks
for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[20] S. Sivaraman and M. M. Trivedi. Looking at vehicles on the road:
A survey of vision-based vehicle detection, tracking, and behavior
analysis. Intelligent Transportation Systems, IEEE Transactions on,
14(4):1773–1795, 2013.
[21] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders.
Selective search for object recognition. International journal of
computer vision, 104(2):154–171, 2013.
[22] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object
detection. In T-PAMI, 2015.
[23] Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d voxel
patterns for object category recognition. In IEEE Conference on
Computer Vision and Pattern Recognition, 2015.
[24] J. J. Yebes, L. M. Bergasa, R. Arroyo, and A. Lazaro. Supervised
learning and evaluation of kitti’s cars detector with dpm. In Intelligent
Vehicles Symposium Proceedings, 2014 IEEE, pages 768–773. IEEE,
2014.
[25] C. L. Zitnick and P. Dollár. Edge boxes: Locating object propos-
als from edges. In Computer Vision–ECCV 2014, pages 391–405.
Springer, 2014.
[26] W. Y. Zou, X. Wang, M. Sun, and Y. Lin. Generic object detection
with dense neural patterns and regionlets. In British Machine Vision
Conference, 2014.
129