Anda di halaman 1dari 11

Report for Computer Vision(IN 4393) Project

Sneha Saha (4600916)

Car Detection and Tracking

In this project rear view of the cars can be identified and detected in a video and a bounding box is used
to track the motion of the car. Rear view of the cars also have distinctive patterns, such as the dark
shadow region right below the car, and dark tire region. With limited training data, here we adopt
semi-supervised multi-task learning of deep convolutional representations, where representations are
learned on a set of training examples and then tested with several videos sequences to generate the
output.The project is executed in a pipeline structure as follows-
- Performed a Histogram of Oriented Gradients (HOG) feature extraction on a labeled training set of
- Normalized these features and randomized a selection for training and testing a classifier
- The classifier trained and used for prediction is a Linear SVM classifier
- Implemented a sliding-window algorithm and used my trained classifier to search for vehicles in
- Ran neural network pipeline on a video stream
- Created a heatmap of recurring detections frame by frame to reject outliers and follow detected
- Estimated a bounding box for vehicles detected


In this project I have used computer vision techniques and neural network model for car detection and
tracking. The programme file is executed in python 2.7 with other libraries.
The techniques are explained as follows-

2.a Feature extraction for car identification-

Vehicle as a class of object vary in color. In contrast, structural cues like shape give a more robust
representation. Gradients of specific directions captures some notion of shape. To allow for some
variability in shape, Histogram of Oriented Gradients (HOG) feature is used. The idea of HOG is instead
of using each individual gradient direction of each individual pixel of an image, we group the pixels into
small cells of n x n pixels. For each cell, we compute all the gradient directions and grouped into a
number of orientation bins. We sum up the gradient magnitude in each sample. So stronger gradients
contribute more weight to their bins, and effects of small random orientations due to noise is reduced.
This histogram gives us a picture of the dominant orientation of that cell. Doing this for all cells gives us
a representation of the structure of the image. The HOG features keep the representation of an
object distinct but also allows for some variations in shape.
Fig 1: Parameter settings for HOG feature

The number of orientations , pixels_per_cell , and cells_per_block is used to calculate the hog features
of a single channel of an image. The number of orientations is the number of orientation bins that the
gradients of the pixels of each cell will be split up in the histogram. The pixels_per_cells is the number of
pixels of each row and column per cell over each gradient the histogram is computed. The
cells_per_block specifies the local area over which the histogram counts in a given cell will be
normalized. Having this parameter is said to generally lead to a more robust feature set. We can also use
the normalization scheme called transform_sqrt which is said to help reduce the effects of shadows and
illumination variations.

I chose to perform HOG to all channels of the image in HSL format. The Hue value is its perceived color
number representation based on combinations of red, green and blue, the Saturation value is the
measure of how colorful or how dull it is, and Lightness how closer to white the color is. My intuition is
that shape of the change in all these measurements provide a good representation of the shape of a
vehicle. I chose an 8 pixel by 8 pixel cell and 2 cell by 2 cell block as inspired by the examples from the
lectures. This also seems like reasonable parameters as our training set is composed of 64 pixel by 64
pixel images. I chose 12 orientation bins as it was typical to choose a number between 6 and 12 for this
parameter. I also chose to enable the tranfrom_sqrt normalization scheme to help reduce the effect of
shadows and illumination variations.
Fig 2: Vehicle and non vehicle visualization of HOG features in normalized form

All the features are normalized. As above pictures shown, the Raw feature only show you three bars,
other bars values are too weak to show up. The function StandardScaler().Fit(X) and transform()
function scale up all features samples from cars and non-cars datasets.
The function used for HOG feature extraction in the -
I. Param() -in this function all the all the parameter like colorspace ,orient,pix_per_cell etc are set
up .The parameter can be adjusted according to outcome.
II. get_hog_feature() - this function takes the parameters of the param() and use hog function of
the skimage.feature library to extract the hog features from the training dataset.
III. Extract_ hog_features() - this function iterate over all the training images and return the list of
feature vectors.
IV. extract_features() - its apply the color space conversion if other that YCrCb. All other color
space are converted to YCrCb and added to the feature vector list.

2.b Applying classifier in an image frame -

I have used a total of 8,792 samples of vehicle images and 8,968 samples of non-images in my data set.
This data set is preselected by Udacity that come from the GTI vehicle image database and the KITTI
vision benchmark suite(
, ) These images are scaled down to 64 pixels by 64 pixels each as
mentioned before. After loading all images in memory, I have used extract_feature() to extract
the features of all images at the data set. I have used the StandardScaler from SKlearn to make a scaler
based on the mean and variance of all the features in the data set. The StandardScaler normalizes
features by removing the mean and scaling it to unit variance. We use this scaler to transform the
featured the raw features from extract_feature() before feeding the scaled feature to the our classifier
for training or predicting. We do this as a safety measure because it is a common requirement for
machine learning estimators, as they might behave badly if an individual feature do not look like the
standard normally distributed data ( ).
I used 80% of the data set to train the classifier. The remaining 20% is used to determine the accuracy of
the classifier. To randomize the splitting of the data we used the built-in train_test_split function from
sklearn and fed it a random number between one and one hundred.

Linear Support Vector Machine is used as a classification algorithm. It has advantages including being
effective in high dimensional spaces even when the number of dimensions is almost as large or even
larger than the number of sample such as this case.

2.c Sliding window -

I have to implement a method of searching for vehicles in an image. We can get a subregion of an image
and run that classifier in that region to see if that patch contains a vehicle. Firstly, we have to consider
that getting the HOG features is extremely time consuming, so instead of getting the HOG image for
each patch region with have many overlaps with each other, we extract hog features of the whole frame
of an image and then we subsample that extraction for each sub window of that image.

The region of interest for the vehicle detection starts at an approximately 400th pixel from the top and
spans vertically for about 260 pixels. Thus, the region of interest has the dimensions of 260x1280,
starting at 400 th pixel vertically.The parameters are set in the random_scan_boxes() and
slide_window() function. The region of interest size is adjusted for test videos .
Fig 3: Region of interest w.r.t. Original image for bounding box formation.

For the sliding window we need a frame, a window size (ws) in pixels and the starting vertical position
(y) (y axis in pixels where we want to search for a vehicle) is needed. It then outputs a list of a locations
of where the vehicles are found. The locations are represented by the top corner of the subregion and
the length of the sides in pixels. We initialize a slider instance with extract_feature(), classifier and the
number of pixels horizontally to increment a sliding window ( increment ) horizontally on the x axis. The
function for drawing the sliding window-

I. Draw_boxes - this function takes the image as input and draw boxes around the scan region.
The color and thickness of the rectangle boxes can be adjusted in this function.
II. slide _window - this function adjust the x and y coordinates of the rectangle box used for
tracking. Initial point of the x and y coordinate is set at first and the span in both coordinates are
determined in which the object need to searched. Once the span is determined based on the
number of pixels in the span ,window span is calculated. The window is looped over the
classifier output to draw window in a sequence.
III. Random_scan_boxes ()- this function create a list of scan window coordinates. The window size
, overlap region of the sliding windows and the initial positions can be adjusted in this function.

2.d Convolution of neural network pipeline :

A neural network model is created for the vehicle detection and search location in the image. The
vehicle scanning neural network pipeline consists the following steps:

I. Obtain the region of interest from the previous steps.

II. The detection map is produced using trained CNN model and apply the confidence threshold.
The predictions are very polarized, that is, they mostly stick to Ones and Zeros for vehicles and
non-vehicle points. Therefore, even the midpoint of 0.5 for a confidence threshold might be a
reliable choice.

Fig 4: Heat map of the region of interest

III. The obtained detection areas is labelled with the label() function of the
scipy.ndimage.measurements package. This step allows outlining the boundaries of labels
detected when building the Heat Map.

IV. The label feature of the detected point are projected to the coordinate space of the original
image, transforming each point into a 64x64 square and keeping those squares within the
features area bounds.

V. The heat map is created with overlapping squares from the images above. A heatmap combine
overlapping detections and remove false positives. To make a heat map we start with a blank
grid and add heat (+1) for all pixels within windows where positive detections are reported by
the classifier. The white the parts, the more likely it is a true positive, and we can impose a
threshold to reject areas affected by the false positives. We have integrated a heat map over
several frames of video. Areas with multiple detections get hot while transient false positives
stay cool. We have made a HeatMap function to implement this. We initialize with a size of
the HeatMap by feeding a sample frame, the threshold and its memory size or how many
frames it will keep before rejecting the oldest frame. I used scipy.ndimage.measurements.label()
to identify individual blobs in the heatmap, each blob corresponded to a vehicle. I constructed
bounding boxes to cover the area of each blob detected.

Fig 5: Heat map with overlapping squares of the detected region
VI. The heat map is labelled again, producing the final islands for actual vehicles bounding boxes.
Labelling of this heat map creates island for detection.

VII. The labeled features of the Heat Map are saved to the list of labels, where a certain number of
consequent frames are saved in series.

VIII. The final step is getting the actual bounding boxes for the vehicles. OpenCV function
cv2.rectangle() clusters all the input rectangles using the rectangle equivalence criteria that
combines rectangles with similar sizes and similar locations. The function has a group Threshold
parameter responsible for "Minimum possible number of rectangles minus 1". That is, the
bounding box won't produce any result until the history accumulates bounding boxes from at
least that number of frames.


There are many inbuilt library functions in python which help inline plotting, extracting feature from
grayscale image and for classification. The libraries which I used are as follows-

Numpy - Numpy is a library for the Python programming language, adding support for large,
multidimensional arrays and matrices, along with a large collection of high-level mathematical
functions to operate on these arrays.

Matplotlib - Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy.

Scikit-learn - Scikit-learn is a free software machine learning library for the Python programming
language. It features various classification, regression and clustering algorithms including
support vector machines, gradient boosting and k-means.

OpenCV - OpenCV is a library of programming functions mainly aimed at real-time computer

Glob - Glob is a library specify sets of file names with wildcard characters.
Keras - Keras is an open source neural network platform in python for fast experimentation with
deep neural network.
Scipy- is a open source python library for scientific computing for optimization, linear algebra
moviepy - a Python library for script-based video editing (cuts, concatenations, text insertion,
non-linear editing)
IPython - is a command shell for interactive computing in multiple programming languages,
originally developed for the Python programming language

The project is implemented using Python 2.7 and following steps need to be carried out to execute the

1. Download the python file and the dataset folder.

2. For running this project libraries like numpy, skimage, cv2,openCV,scipy,glob,keras and sklearn
need to be installed.
3. Make a folder name Frames within the location where python files and dataset are kept.
4. Put the input video in the same location.
5. Open the python code( for vehicle detection tracking and change the
name of the input video.
6. Run the python code and wait for it to end.

The folder test_videos contains all the test video files and the folder test_video_output contain all
the output videos of tracking with this model. Vehicle and Non vehicle folder contains all the
training image of cars and non cars images.


The labeled data for vehicle and non-vehicle examples to train classifier come from a combination of the
GTI vehicle image database, the KITTI vision benchmark suite, and examples extracted from the training
video itself. A neural network model is used in addition as it adapt to different camera perspectives,
lighting conditions and reflections in the vehicles. Keras can separate a portion of your training data into
a validation dataset and evaluate the performance of neural network model on that validation dataset
with each epoch.

The locate() take any size input image, and region windows, it will find the vehicle within the region and
return marked heat map and vehicle bounding box coordinates. This function is added to the
NN_pipeline() which uses the neural network for vehicle detection and heat map creation. The final
bounding box for vehicles are detected by label function from scipy.ndimage.measurements package are
drawn using the cv2.rectangle() function of opencv library.

5.a Results :

The function get_hog_feature(), extract_hog_feature() and extract_features() extract group HOG

features, spatial feature, color histogram feature all together. For HOG feature calculation 8 orientations
7 pixels per cell and 2 cells per block with YCrCb colorspace is used.For training purpose this
combination gives an accuracy 99.40%.
The result for the SVC classifier to extract HOG features from the vehicle image dataset -
130.45 Seconds to extract HOG features...
Using: 8 orientations 7 pixels per cell and 2 cells per block
Feature vector length: 9312
6.34 Seconds to train SVC...
Test Accuracy of SVC = 0.994
My SVC predicts: [ 0. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
For these 10 labels: [ 0. 1. 0. 1. 0. 0. 0. 1. 0. 0.]
0.00243 Seconds to predict 10 labels with SVC
Training accuracy: 1.0
Testing accuracy: 0.99402900199

In the neural network pipeline Keras can separate a portion of the training data into a validation dataset
and evaluate the performance of the model on that validation dataset with each epoch. In line number
757 the epoch parameter can be set to improve the test accuracy.

The neural network model is trained by the vehicle dataset and non vehicle images from the frames of
project_vedio.mp4.The result of the neural network is -
Test score: 0.020839428846
Test accuracy: 0.989301801802

5.b Experiments-

The project involved collecting training data, allowing the cars CNN to learn from the training data,
using a simulation to test the car, and finally driving it on other test videos. The network was trained on
sequences of driving footage from project_video.mp4 and tampa_traffic.mp4. The training data included
footage from front-facing cameras of the car. As the data was fed into the CNN, it learned to output the
appropriate steering command for different situations. For experiment purpose I have collected some
video sequences in the folder test_videos. The videos are run against this neural network
model( by adjust parameters for region of interest and also the epoch
parameter of the neural network model. The outcome features of some of the experiments tracking are
as follows-

1. video 1 - the video sequence has a tunnel in which the tracking of the car is lost. The car is
tracked before entering the canal and after the canal passes .The bounding box sequence
disappear once it enter the canal as the frames in the video sequence are unable to classify car
inside the tunnel.
Fig 6: sequence from video 1 showing tracking before and after entering the tunnel

2. video 2,video 3 and video4 - the video sequence is pretty similar to the training sequence so the cars
are tracked along all the frames. The rear cars are tracked and detected but the cars is the opposite
lanes are not detected by the model. In video 3 even if due to low lighting condition the rear view of the
cars are tracked to some extends but in video 4 the tracking was improper as the car was at a far
distance and also due to low contrast in the image.

Fig 7 :sequence from video3 and video 4

3. video 5 - the video was captured in a circular way at a corner of the scene. In the output the sliding
window detect the rear view of the cars but the tracking was lost.The movement of the camera was too
fast to track the cars even after adjusting the region of interest parameters.

Fig 8: shows the sequence from video5 output

The car detection and tracking worked relatively good when there is less car in the track but when the
streets are crowded with many number of cars the bounding box fail to adjust the width for different
cars according. Another limitation is that it often fails to identify cars that are far away. Also due to
limitation of the variation of the vehicle dataset, number of cars were missed while tracking.

Neural network is fast but with the CPU setting the time was not that faster. The good thing for the
Neural Network is easy to setup and training. Once enough data of the vehicle and non-vehicle images
and labels, the network can be trained. The network will figure out a vehicle, part of the vehicle and the
car location in the image. The interesting thing is it will pick up the tail lights, signs, traffic light, tree top
and skyline as a vehicle. The above head objects are easy to remove, but head/tail lights are hard to deal
with. They may belong to one car or two cars. If the vehicle is in close range, the head/tail lights will
show up in two blobs, but I don't want to mark them as two cars.Also, it is possible to collect different
object groups, such as people, lane marker, tree, post, sign, traffic lights, building etc, and train the
network all together for future work.

The training capability of the neural network detection model permits the system to adapt to variations
in lighting and camera placement. This technique does not rely on the development or calibration of
rigid templates. Instead, it learns to recognize vehicle shapes by watching several example vehicles. This
provides a significant benefit in comparison to classical image processing, since the neural network can
adapt to different camera perspectives, lighting conditions, and so on.