Abstract—On-road obstacle detection and classification is one obstacle detection is highly challenging when designed and
of the key tasks in the perception system of self-driving vehicles. deployed with optical sensors. Environmental conditions also
Since vehicle tracking involves localizationand association of add another challenge for acquiring high-quality images.
vehicles between frames, detection and classification of vehicles is
necessary. Vision-based approaches are popular for this task due Existing techniques for vision-based on-road obstacle
to cost-effectiveness and usefulness of appearance information detection techniques [1], [2] have not progressed into its
associated with the vision data. In this paper, a deep learning mature form due to many issues such as variability in vehicle
system using region-based convolutional neural network trained shapes, cluttered environment and illumination conditions.
with PASCAL VOC image dataset is developed for the detection Deep learning [3] has shown great promise in recent years in
and classification of on-road obstacles such as vehicles, the field of object detection and recognition. Convolutional
pedestrians and animals. The implementation of the system on a Neural Networks (CNN) are dedicated to vision-based
Titan X GPU achieves a processing frame rate of at least 10 fps approaches and they are quite feasible for Graphics Processing
for a VGA resolution image frame. This sufficiently high frame Unit (GPU) acceleration in real-time applications. The GPUs,
rate using a powerful GPU demonstrate the suitability of the originally designed for 3D modeling and rendering, are now
system for highway driving of autonomous cars. The detection solving classic image processing problems and provide
and classification results on images from KITTI and iRoads, and tremendous improvement in speed over CPU-only
also Indian roads show the performance of the system invariant implementations. GPUs when deployed in the perception
to object’s shape and view, and different lighting and climatic
system of autonomous vehicles could process video frames at a
conditions.
sufficiently high frame rate and facilitate high-speed driving by
Keywords—Autonomous driving, object detection, object detecting the obstacles well before for motion planning to
classification, deep learning, convolutional neural network, R-CNN avoid collision.
In this paper, we address the detection and classification of
I. INTRODUCTION on-road objects using Faster Region-based CNN (R-CNN), a
Collision avoidance system is a key component in self- variant of CNN and its implementation in GPU. We employ a
driving vehicles and obstacle detection is one of the main tasks pre-trained network model ZF Net, fine-tuned for 20 different
of this system. The most known approach to obstacle detection objects of PASCAL VOC 2012 dataset [4] in our detection and
uses active sensors likelidars,lasers,millimeter-wave radars. classification system. During the real-time detection phase, the
They can measure distance directly using limited computing on-road object detections are filtered such that the entire
resources which shows their main advantage. However, these system is made to detect only the classes which correspond to
active sensors do have many drawbacks, likeslow scanning on-road objects. The outputs of the system are the rectangular
speed and low spatial resolution. Moreover, the interference bounding boxes and class information of objects which are
among the same type of sensors creates a serious problem useful parameters for motion planning of the self-driving
when it encounters a number of vehicles closely moving vehicle.
togetheralong the same direction simultaneously. Optical This paper is organized as follows. The next section briefly
sensors, like conventional cameras,collect data in a way which describes the conventional CNN and its variant R-CNN.
is non-intrusive and are generally referred to as passive Section III explains the on-road obstacle detection and
sensors. Cost is one of the major advantages for preferring classification system. The GPU implementation of the
passive sensors to active sensors. Moreover, visual information systemusing Caffe framework along with the results and
plays a key role in several applications, like object performance are given in Section IV. Section V concludes the
identification, traffic sign recognition and lane detection. On paper.
the other hand, due to several variabilities within the classes,
Person? No
Car? Yes
:
CNN SVM
Dog? No
training was performed as this is faster than the alternating neural networks and other deep networks and hence
training.Since the training set contains a lot of non-road deployingthem efficiently on commodity architectures.
objects, the network requires a retraining with only the on-road
object classes. We have not however retrained the network. A. Detection Results
Instead, the non-road object detections were masked in the The implementation was tested on a variety of datasets in
detection phase efficiently. different climatic conditions. Images were from the public
datasets such as Kitti [8] (size: 1392x512) and iRoads [9] (size:
B. Function of the System 640x360). The video frames from the shots taken on Bangalore
For real-time detection on video, each image frame is fed road (size: 1920x1080) and Chennai road (size: 1920x1080)
to the system. This image frame is processed by the trained R- from a camera on-board a vehicle were also considered. Apart
CNN module for the bounding boxes of various class-specific from these, on-road animal images (size: 1025x680, 1001x608)
objects. Since this system is designed for detecting only the were also tested. Fig. 3 to Fig. 6 show the performance of ZF
on-road objects, some classes are masked in such a way that Net model of faster R-CNN on GPU. The objects in Fig. 3
only the on-road objects are recognized. This is done by a were detected in 48ms for 300 object proposals. The objects in
filter. The processed image is then annotated with bounding Fig. 4 were detected in 70ms for 163 object proposals. The
boxes tagged with the respective class name on top of each objects in Figs. 5 & 6 were detected in around 90ms and 60ms
detected object. The bounding box colors are listed in Table I. respectively for 300 object proposals. The results show the
robustness of the approach to different views of objects as well
These bounding boxes are fed to a tracking module for motion
as the lighting conditions. The detection time is less than
planning of an autonomous vehicle. 100ms for an image of considerable size.
TABLE I. LIST OF ON-ROAD OBSTACLES AND THEIR RESPECTIVE The detection time was also computed for images of
COLORS OF THE BOUNDING BOXES various standard display resolutions. The bar chart in Fig. 7
Class name Color of the Bounding-Box displays these results. It can be seen that most of the
resolutions can be processed at a frame rate of 10 fps.
Bicycle Red
average precision ( AP ) =
∑ P ∀ True Positives
IV. GPU IMPLEMENTATION AND RESULTS True Positives
The obstacle detection and classification system was
implemented on Ubuntu workstation with NVIDIA GeForce mean average precision (mAP ) =
∑ AP ∀ Classes
GTX 980 Ti GPU. The GPU has 6GB graphics memory, 2816 Number of Classes
CUDA cores. The workstation is powered by Intel i7-6700
with 16GB RAM. There are two modules in the proposed The detected object is a true positive only if the Intersection
system. The main module runs on CPU. The second module over Union (IoU) of ground truth and detected bounding boxes
that includes the Caffe framework of R-CNN runs on GPU. is ≥0.5. Table II shows the mAP calculation of Kitti_drive0005
This framework has a C++ library with MATLAB and python video shot containing 153 video frames. Tables III and IV
bindings used for training the general purpose convolutional show mAP calculation of Chennai road dataset of 50 images
and Bangalore road dataset of 100 images respectively.
2017 IEEE Region 10 Symposium (TENSYMP)
Fig. 5. Cars, bus and person detected on Chennai and Bangalore Highways
TABLE II. MEAN AVERAGE PRECISION FOR KITTI_DRIVE0005 VIDEO TABLE III. MEAN AVERAGE PRECISION FOR CHENNAI ROAD VIDEO
Average Precision mAP (%) Average Precision mAP (%)
Class name Class name
(AP) (AP)
Bus 1 71.7 Bus 0.62 90.5
Motorbike 0 Motorbike 1
Fig. 6. Animals on road are detected along with pedestrians and cars