Anda di halaman 1dari 30

Automotive Traffic Tracking through Pattern Recognition

John Martin

Introduction
As automotive traffic patterns in many large cities grow more complex, the study of these
traffic flows requires new tools which can bridge the gap between treating traffic flows as
a statistical fluid and examining individual driver behavior. Presented here is a tool
which can track individual vehicles as they pass an overhead visible light camera. The
result of such tracking will be a set of trajectories from which human driver behaviour
and vehicle interaction may be studied.

This paper will describe the image pattern recognition techniques used to observe
vehicles as well as the tracking techniques used to tie each frame’s pattern recognition
results together. In addition, we will discuss reasons for moving away from the motion
segmentation techniques commonly used in the past as well as the current limitations of
applying these techniques.

Existing Systems
Previous research in automotive tracking systems has not been completely successful. In
[6] Kalman-Snakes provide automobile contour tracking after an initial motion
segmentation step. The authors of [2] use block matching to find optical flow, and a
priori knowledge of the road geometry is used to handle stationary vehicles. In [14]
background estimation isolates foreground objects as “blobs”, and principal component
analysis is then used to classify the blobs and estimate their orientation.

Motion Segmentation Systems


Many video tracking systems identify objects by virtue of their motion. In cases where
vehicles are moving quickly past the sensing camera, these motion segmentation
techniques are fast and robust. Unfortunately, in cases where the sensing camera observes
a largely stationary traffic light queue, motion estimation based systems begin to have
problems. In these cases, motion segmentation often cannot be used because there is very
little motion to be observed. Additionally, long shadows cast by vehicles can cause
regions of motion to bleed together into a single large motion segment. Also, inevitable
camera vibration can render many motion segmentation algorithms useless, since camera
vibration moves the entire image and segmenting the motion of individual cars becomes
more difficult. In these problem situations, it becomes necessary to identify vehicles by
their appearance rather than their motion.

1
Finding Cars by Appearance
Since we have dismissed motion segmentation as unreliable for this application, the
traffic tracker must be able to identify a certain pattern of pixels as being the image of a
vehicle. Ideally, one could set up a lookup table which would contain every pixel pattern
possibility coupled with a binary value indicating the pattern’s class, either “vehicle” or
“no vehicle”. Classification would be a simple matter of retrieving the binary value
residing at the current pattern’s location in the lookup table. Unfortunately, aside from the
difficulties of training such an ideal classifier, no computer or future computer has
enough memory for a lookup table with 6400 256 entries. (80x80 pixel image fragment
with 256 discrete levels) [9].

Because this “ideal” pattern classifier is not possible to implement, a more realistic
pattern classifier might search for simple features which are specific to images of
vehicles. Unfortunately, we have not been able to find simple, specific vehicle features
which are invariant between vehicle types, reflectivity, orientation and lighting.

Our initial attempts at solving this vehicle recognition problem used geometric primitive
templates. This earlier algorithm discovered edges of cars in the input image with the
Canny edge detector [22], and used those edges to fit ellipses [21]. If the ellipses were
the correct length and width for a vehicle, a vehicle was assumed to exist at the center of
the ellipse. Unfortunately, the resulting algorithm proved unwieldy and highly dependent
on the setting of thresholds. With proper tweaking, the algorithm could work for short
periods of time on example video, but, generally, the algorithm would fail when
presented with geometries and lighting different from the situation for which the various
thresholds were set. As an example, closely packed vehicles at traffic light queues made
it particularly difficult to assign distinct, correct ellipses. Also, a shadow could eliminate
the distinct edges on one side of the vehicle and thereby change the geometry of the
vehicle significantly. The ellipse fitting experiments revealed that using simple templates
is a problem because there is no systematic method of extending simple templates to new
geometries and lighting conditions.

In contrast to the manual process of selecting a template, some pattern recognition


techniques attempt to build classifiers via automatic training [9][15][16][17]. These
techniques use example images, as shown in Figure 2, to train a classifier which decides
whether a newly sampled input pattern contains a target pattern. The use of these
automatic methods of training a vehicle classifier may provide us with a means of
updating the traffic tracker’s functionality when previously unknown geometries and
lighting conditions occur.

Scanning the Input Image

2
A pattern classifier can be used as a scanner which moves across an input image,
classifying an 80x80 sub image at each image location. Figure 1 attempts to illustrate the
scanner. Each of the sub images selected by the scanner are then fed into a classifier
which decides whether there is a car centered in the sub image. Figure 2 shows several
80x80 sub images which do have cars at their center.

Figure 1 The scanner selects all the 80x80 square sub images from the original image and tries to
discover if there are cars in the middle of the sub image

Figure 2 Positive training images, car centered in an 80 by 80 pixel sub image

3
Support Vector Machines
In the first attempt to apply pattern recognition techniques to this problem, we used a
support vector machine (SVM) [17] to indicate whether a certain set of pixels could be
classified as a vehicle. This SVM classifier was then applied to each 80x80 sub image
within the original image to find car image locations.

This method showed promise-- one could even say that it “worked”. The primary
drawback was that it could require one computer more than a hundred seconds to scan
one frame of size 768 x 200 for vehicles. The use of multiple networked computers
brought the frame evaluation time down below ten seconds, but the complexity of
messaging across a network made this solution an undesirable one.

The SVM classifier was trained with approximately 300 features selected from a sub
image’s Haar wavelet coefficients. These features were selected manually, with no
algorithmic guidance other than generally picking low frequency coefficients in the
vicinity of the center of each sub image under consideration. In short, these 300 features
were selected by best guess. Unfortunately, leaving feature selection to “best guess” is
not a particularly good way of designing a classifier. Also, selecting features from a
complete basis set such as the Haar basis is limiting. An overcomplete basis set provides
a richer set of features to choose from.

Viola’s Face Detector


In [23] Viola uses a classifier which is trained with an integrated feature selection method
and produces good results with face detection. Viola uses sets of weak classifiers [23].
Each of these weak classifiers is associated with a simple image feature which can
correctly distinguish a vehicle pattern from a non-vehicle pattern perhaps 60% of the
time. Taken together, several hundred of these weak classifiers can provide classification
performance much greater than the performance of the individual parts.

Viola also discusses how to “cascade” simple classifiers such that certain regions of the
input image are eliminated from consideration early in the classification process. Because
no further computational resources are devoted to these eliminated regions, this early
elimination can produce significant increases in classification speed.

Collecting Viola’s Weak Classifiers with Adaboost


A collection of Viola’s classifiers is not necessarily capable of combining as a single
strong classifier. If each classifier in a collection of weak classifiers makes the same
types of errors as its mates, there is no method of combining the classifiers’ output such
that a combined classifier has better performance. Viola uses the Adaboost algorithm,
which [24] attempts to ensure that each weak classifier in a collection of classifiers makes
varied types of mistakes. Adaboost does this by varying weights on the training set such

4
that training set patterns which are already “taken care of” by previously trained
classifiers do not heavily impact the training of the current weak classifier.

Viola’s Feature Evaluation—Integral Image Features


One of the primary requirements for selecting the types of features used by these weak
classifiers is that the feature’s evaluation must be fast. For example, features which
require several hundred element dot products for evaluation are too slow. Viola uses an
integral transform to quickly find the sum of the pixel values in any rectangular region in
the image. An integral transform consists of:

x, y
II ( x, y ) = ∑ I ( x, y )
0,0

where I ( x, y ) is the pixel value of image I at coordinate x, y . After an initial


computation of the integral transform, II ( x, y ) , the sum of pixels within any rectangle in
image I which has vertex points A,B,C and D in the arrangement shown in Figure 3 can
be found by simply evaluating

Area = II ( B) − II (C ) − II ( A) + II ( D)

Figure 3

This speedy method of discovering the sum of pixels within a rectangle can be used to
evaluate a limited set of overcomplete operators which can be used as weak classifiers.
These operators may be thought of as match detectors. The filter kernel for several types
of these operators may be found in Figure 4, where gray regions are zero, white regions
are 1 and black regions are -1. After the integral transform has been used to find the sum
of pixels in both the white and black regions, the operator value is found by subtracting
the black region sum from the white region sum.

5
Using the Features

Figure 4 Examples of simple operators to be evaluated with an integral transform


Figure 4 shows these operator examples placed in the context of the 80 by 80 pixel sub
region which is used for classification and training. The location, size and orientation of
the above operator types may be changed such that the numeric result of evaluating the
operator can be used to distinguish between an 80 by 80 pixel training image which has a
car centered within it and a training image which does not have a car centered in it.

For example, if one desired a simple classifier which identifies BMW logos (Figure 5)
within an input image, the first feature type shown in Figure 4 could be used.
Conveniently, this feature shares the “center of gravity symbol” appearance with the
BMW logo. The feature could be centered within the 80x80 input region, and the size
increased such that it looked like the feature operator shown in Figure 6.

Figure 5 An object which is easy to find


Unfortunately, identifying BMW logos is a simple case in which a single feature can
provide reasonable classification performance. In the case of identifying cars, an
individual operator using a single simple feature cannot distinguish between all car and
non-car patterns. An individual operator might be hard-pressed to classify even 60% of
the car, non-car images correctly.

Figure 6 This operator detects BMW logos

Viola’s Classifiers

6
Viola’s adaptation of Adaboost provides a method of weighing and selecting these feature
operators such that a collection of the operators may classify a complex object such as a
car with a reasonable error rate.

Viola defines a simple classifier based on features evaluated with the integral
transform[23]. This weak classifier consists of:
1 if pf ( x) < pθ
h( x ) = 
0 otherwise

where h is the classifier, p is the parity, θ is the threshold and f is one of the integral
transform-type features discussed above and x is the 80 x 80 sub image. The classifier
reports a value of 1 when it believes that a car is found, and it reports a value of 0 when a
car has not been found. Training a classifier which uses a given feature f consists of
discovering the threshold and parity which maximize it’s classification performance
within the training set.

Viola’s Use of Adaboost


Viola’s method of training a set of weak classifiers ht uses the Adaboost method of
varying weights associated with the training set in the following way [23]:

• Given example images ( x1 , y1 ),..., ( x n , y n ) where y i = 0, 1 for positive and negative


examples respectively.
1 1
• Initialize weights w1,i = , for y i = 0,1 respectively, where m and l are the number of
2m 2l
negatives and positives respectively.
• for t = 1,..., T :
wt ,i
1. Normalize the weights, wt ,i ←

n
j =1
wt , j
2. For each possible feature, f j , train a classifier h j . Find the error for the
classifier by evaluating ε j = ∑i wi h j ( xi ) − y i .
3. Choose the classifier with the lowest error ε t
1− e
4. Update the weights: wt +1,i = wt ,i β t i where ei = 0 if example xi is
εt
classified correctly, ei = 1 otherwise, and β t = .
1− εt

 1 T
1 ∑t =1α t ht ( x ) ≥ ∑t =1α t
T
1
• The final classifier is: h( x) =  2 where α t = log
0 otherwise βt

7
Note that the weight associated with each training example is decreased as weak
classifiers correctly classify the training example. In this way, as weak classifiers are
trained, they are less responsive to training examples which are already “covered” by
previously trained weak classifiers. Also note that the final classifier sums the weak
classifiers’ results with weights based on the weighted error of the classifier within the
training set.

Unlike the pseudo-code description above, the training code used in the car detection
application does not exhaustively search for the “best” weak classifier on each iteration
of the main “for” loop. It merely looks at several thousand random classifiers and selects
the best classifier. This short-cut substantially decreases training time. However, since
each iteration is not necessarily finding the best classifier, the final classifier’s
performance may suffer.

Cascade of Classifiers
Viola’s feature evaluation is fast, but scanning an image with a 1000 feature classifier
remains time consuming. Thus, initial stages of classification are performed with
relatively inaccurate, but simple, collections of weak classifiers. Areas of the image
which are unlikely to correspond to a vehicle are rejected early in the process. For
instance, in the first stage of the classifier, there are only three features used, shown in
Figure 7. The resulting three feature classifier can find image regions which are not
likely to be vehicles as shown in Figure 8. The portions of the image which are not filled
in with red pixels are non-vehicle regions and these regions may be removed from further
consideration. After this initial classification step, more complex classifiers may then be
applied to areas of the image still under consideration. Figures 9,10 and 11 illustrate this
process. Each classifier stage becomes more complex and time consuming, but each
stage also eliminates sections of the image. The complex latter stage classifiers will
never see the regions of the image which are “easy” to dismiss as non-car. Depending on
the complexity of the image, large speed increases can be realized.

Figure 7 The three operators used in the first classifier stage.

8
Figure 8 The red regions represent areas of the image which are still under consideration as “car-
like”. This is only a quick, first stage analysis of the image, which uses only the three features shown
in Figure 7.

Figure 9 The red regions represent areas of the image which are still under consideration as “car-
like”. This is the sixth stage of a 20 stage classifier. While some of the red regions are correct, there
is still considerable noise.

Figure 10 The red regions represent areas of the image which are still under consideration as “car-
like”. This is the eleventh stage of a 20 stage classifier. Much of the noise has been eliminated.

9
Figure 11 The red regions represent areas of the image which are still under consideration as “car-
like”. This is the final stage of a 20 stage classifier. While all the noise has been eliminated, the car
second from the front on the far side of the queue is no longer detected.

However, classifier stages with small numbers of operators cannot provide the necessary
classification accuracy. Thus, much of the classifier’s heavy-lifting is done in the
classifier’s later stages.

The positive training set remains constant for all classifier stages while the negative
training set consists of negative images which the previous stage failed to classify
correctly. The false positives of the previous stage are used to train the current stage such
that the classifier stages have varied “talents.”

Using a Support Vector Machine as a Final Stage

As discussed above, using a support vector machine classifer [16] to scan the entire input
image is rather time consuming, a support vector machine classifier may be employed as
a final stage after less time consuming classifiers have, hopefully, classified most of the
image as “not car”. The final support vector machine uses three hundred of Viola’s
integral transform features to form a binary valued SVM input vector. The SVM was
tested on a labeled set of test images which were kept separate from the training images,
and its accuracy was 98%.

Support Vector Machines

Support vector machines are a relatively recent development in pattern recognition


[16][17]. Although the full theory is beyond the scope of this paper, here is an attempt to
provide a short introduction.

Say there are two classes of example patterns (vehicle or no vehicle, positive or negative,
for example). Each example pattern may be expressed as a vector and placed as a point
within a vector space shown below in two dimensions.

10
Figure 1 Classifying set A and set B
If the two classes of example patterns are separable, each class forms its own cloud of
points in the vector space and a plane may be drawn between the two clouds of points.
New example vectors are classified by evaluating the side of the separating plane on
which they lie. Implicit here is the assumption that vectors of the same class lie together
in their vector space.

In general, however, the pattern vector space is not limited to two dimensions as shown in
the above figure. Say there is an example set S = {( X i , y i )}i =1 . Where X i is a pattern
m

vector of size n of quantity m , and the y i are simply labels which indicate the vector’s
class, y i ∈ {−1,1} . The classifier then takes the following form:

Equation 1
m
f ( X ) = ∑ λi y i X i X + b
T

i =1

f ( X ) = 0 is a hyper-plane separating the two classes of X i . The λi ’s and the origin offset
b are selected during training such that the margin between the training set points and the
hyper-plane is maximized. For many vectors X i , the corresponding scalar λi will be very
close to zero. These X i may be neglected in the classifier’s summation. The remaining
X i , with non-zero λi , are called support vectors.

Unfortunately, separating the two classes by means of a hyper-plane is a rather limited


strategy. In many cases, the two clouds of points representing the positive and negative
patterns in the vector space are so entwined that they cannot be linearly separated.
However, the vectors will often be separable by a non-linear surface. This non-linear
separating surface may be found by projecting the vectors from the original input space

11
into a higher dimensional “feature” space, where a separating hyper-plane can often be
found. For instance, in the figure below, two classes (x’s and o’s) are placed in a 2D
space, and the classes are not linearly separable. However, one could imagine an ellipse
might be drawn such that the x’s are inside the ellipse and the o’s are outside.

Figure 2 Find a separating surface for the x and o classes

One way of finding the separating ellipse would be to project these 2D vectors [ z1 , z 2 ]
into a higher dimensional space [ z1 , z 2 , z12 , z 22 , z1 z 2 ] where this ellipse becomes a hyper-
plane. [16]

The major drawback to finding a hyper-plane in a higher dimensional space is that each
of the inner products in Equation 1 require a number of multiplications equal to the
dimension of this higher dimensional feature space. This problem would seem to limit
the dimensionality of the feature space. However, most projections into the feature space
are accomplished with an implicit mapping expressed as a kernel function which defines
the inner product between two vectors in the feature space:

Equation 2
K ( X , Z ) = φ ( X ) • φ (Z )

where K is the kernel function and the vector valued function φ () is the mapping from
the original input space to feature space. K is selected such that φ () is complex enough
to possibly separate linearly inseparable classes, while K itself is kept reasonably simple
and relatively computationally non-intensive.

With the introduction of a kernel function, the classifier function becomes:

12
Equation 3
m
f ( X ) = ∑ λi y i K ( X i , X ) + b
i =1

Notice how the kernel function in equation 3 takes care of evaluating the inner product in
feature space. The mapping function φ () does not need to be evaluated at all. [16]

Selection of the SVM Kernel

There are two primary criteria used when selecting an SVM kernel function:
1) The kernel function must provide a rich feature space [7]
2) The kernel function must be computationally non-intensive

As an example, let us consider the following homogeneous quadratic kernel function:


[16]

Equation 4
( n ,n )
K ( X , Z ) = ( X • Z )2 = ∑ (x
( k , j ) = (1,1)
k x j )( z k z j ) = φ ( X ) • φ ( Z )

Here, from Equation 4, we can see that this kernel makes φ ( X ) = ( x k x j ) (1,1) where x k is
( n ,n )

the kth element of X . In this particular case, the mapping function φ provides an n 2
dimensional space in which to find a separating hyper-plane, while computing an inner
product in this n 2 dimensional space only requires n multiplications for the inner product
and one more to square the result.

The complexity of the kernel is only one factor affecting the amount of time it takes to
classify a certain input vector. Additionally, the number of λi ’s, m , linearly affects the
amount of time it takes to evaluate the classification function. Fortunately, the set of
support vectors discovered during training is not necessarily the smallest set of vectors
which can describe the decision surface f ( X ) = 0 .

If the quadratic, homogeneous, kernel is used in the classifier in equation 3, the following
expression results for the SVM classifier:

Equation 5
m
f ( X ) = ∑ λi y i ( X i X ) 2 + b
T

i =1

13
which may be expressed in matrix form as [see Appendix A]:

Equation 6

f ( X ) = X T AX + b

where the elements of matrix A are:


Equation 7
m
Auv = ∑ λi y i xiu xiv
i =1

and xiu means the uth element of the ith vector X i .

Since A is symmetric, it may be decomposed into A = ΛT VΛ where Λ contains the


eigenvectors of A and V is a diagonal matrix with the corresponding real eigenvalues
along the diagonal. Another way of expressing this spectral decomposition is:

Equation 8
n
Auv = ∑ α i z iu z iv
i =1

where Z i is an eigenvector of A and α i is the corresponding eigenvalue. We can see then,


that if the support vectors are replaced by the eigenvectors of A and the λi y i are replaced
by the corresponding eigenvalues, we obtain the same classifier decision surface, which
is a good thing because the number of support vectors needed to define the decision
surface f ( X ) = 0 have now been reduced-- n << m . This result is given in [3], but it
was derived with regard to minimizing a function. It seems as if the above derivation is a
more direct path to the result.

With this exact simplification method, a quadratic, homogeneous kernel classifier may be
greatly simplified by finding the eigenvectors and eigenvalues of the symmetric matrix
A . However, this method works only on quadratic, homogeneous kernels, which may or
may not offer the mapping and the feature space complexity required.

Clustering Positive Vehicle Indications

After a frame is scanned, some locations within the image are suspected of being cars.
However, there are many more suspected vehicle locations than there are vehicles in the
image. These suspected vehicle locations must be brought together in clusters, such that
there is, ideally, only one measurement for each vehicle.

The clustering method employed in this application uses a Delaunay [25] triangulation.
The locations within the image which are suspected cars are triangulated as shown in

14
Figure 12. After triangulation, triangles which have a large area or are long and thin are
discarded and small,well-behaved, triangles are kept as vehicle locations. The centroids
of the resulting polygons are considered to be the final measurement of the vehicle’s
location. The area of the resulting measurement polygon is considered to be the
measurement’s confidence because actual vehicles tend to produce many vehicle
indicators, while false positives tend to be isolated.

Figure 32 Delaunay triangulation clustering

Tracking Vehicles

Associating the measurements of vehicle location between frame updates is commonly


known as “tracking”. For example, many tracking problems associate radar blips
between antenna sweeps. The problem of associating vehicle measurements between
frame updates is similar one.

In this case, a Kalman filter [26] is initialized for each untracked measurement which
exceeds confidence thresholds. This Kalman filter is updated with measurements as long
as a measurement appears within some distance of the Kalman filter’s current state. This
track-to-measurement association window varies in size depending on the confidence in
the Kalman filter’s internal states. A Kalman filter instance which has not received an
update for several frames has a larger track-to-measurement association window than a
Kalman filter which received an update in the previous frame.

Track-to-Measurement Association

For each frame, there is a set of measurements which must be associated with a set of
existing tracks. While some methods attempt to use probabilistic methods of associating
measurements with tracks [20], this application uses a simpler method which requires

15
that each track either be associated with a single measurement, or not associated with a
measurement at all. The association algorithm first tries to group tracks with high-
confidence measurements. If a track is near one of the high confidence measurements, it
is updated with that measurement. The remaining measurements are then placed in
confidence groups and applied to the remaining, unassociated tracks until there are no
further measurements. This ordering of measurements gives a higher priority to high-
confidence measurements, and it seeks to avoid a situation where a track is updated with
the closest measurement without regard to the measurement’s confidence.

Figure 13 Screen shot of the tracking application

16
Tracks which have not been updated by an actual measurement use their internal states
and the Kalman filter’s system model to find their location in the next frame.

Track Criteria

If a track does not meet certain geometric criteria, it is judged to have not come from a
vehicle, and it is dropped. Specifically, tracks must have a minimum length and a
minimum extent. Length is the linear distance between the start and end. Extent is the
size of the smallest rectangle which can be drawn around the track. If a track lies outside
the preset geometric parameters, it is rejected.

World Coordinates

Measurements taken from an image are not terribly useful unless there is a method of
associating a location in world coordinates with each pixel location in the image. It is
necessary to find a function which can transform an image location in pixels to a world
location in feet or meters... regardless of lens distortions. This process of finding an
image-to-world transformation function is often referred to as “calibrating” a camera, and
the image-to-world function and its parameters are often called a “camera calibration”.

Camera Calibration

There are two general types of camera calibration parameters, extrinsic and intrinsic.
Intrinsic calibration parameters are parameters which describe the distortions intrinsic to
the camera, while extrinsic parameters describe the orientation of the camera with respect
to the world coordinates. A commonly used method for finding intrinsic camera
parameters is described in [27]. This camera calibration method is encapsulated in an
easy-to-use Matlab toolbox [28] which requires the user to take images of a checkerboard
pattern held in front of the camera at various angles. The Matlab script then calculates
the intrinsic camera parameters.

17
After the intrinsic camera parameters are found, it is necessary to find the extrinsic
parameters of the camera. While it is assumed that the location of the camera has been
surveyed by GPS, the orientation of the camera is not immediately known and must be
calculated by knowing the correspondence between world and image coordinates for
three points in the image. Since the intrinsic camera parameters are already known,
accurate vectors in a camera-centered coordinate system may be found for each of the
three surveyed image locations under consideration. If the world locations for each of the
three image locations have been surveyed by GPS, a linear transformation R may be
found such that:

Vwi = RVci

i
Where Vw represents the ith unit vector which extends from the camera’s focal point
i
towards the surveyed object in world coordinates, and Vc represents the very same vector
expressed in camera coordinates. R is a 3x3 matrix which represents the rotation
between the world coordinate system and the camera coordinate system. Suppose that
three world coordinate vectors exist such that:

[
Vw = Vw1 Vw2 Vw3 ]
and three camera coordinate vectors form a matrix such that:

[
Vc = Vc1 Vc2 Vc3 ]
then:

Vw = RVc

and if the vector sets contained in the matrices Vw and Vc are not coplanar, R may be
found by:

R = VwVc−1
The rotation matrix R should be generally orthonormal. Indeed, a reasonable sanity
check on the result could consist of checking that the dot products of the matrix columns
are close to zero. One could also check to see if the norm of the columns is close to
unity. There are several reasons why R may not be orthonormal—a non-exhaustive list
follows:

• Camera calibration intrinsic parameters could be wrong


• The three surveyed vectors may have been chosen such that they are too close to
being co-planar.

18
• The surveyed vectors were not normalized

If greater accuracy is desired, more than three surveyed points may be used. Since
Vw = RVc is overdetermined for more than three vectors, least-squares or some other
minimization method may be employed.

Using the Camera Calibration

The camera calibration parameters from [28] consist of:

• principal point in x
• principal point in y
• focal length x
• focal length y
• four radial distortion coefficients

C code for using the above parameters may be found in Intel’s OpenCV library. The
OpenCV function “icvNormalizeImagePoints” will use the above parameters to convert
an image point xi , y i to a normalized image point xc , y c . In this case “normalized”
means that the original image point, xi , y i , has been transformed such that the original
image had been captured with an imaginary, ideal, pinhole camera with unity focal
length. “icvNormalizeImagePoints” will return a vector Vc for image point xi , y i .
Vc = ( x c , y c ,1) is a vector which points from the camera’s focal point to the object in the
real world which had been represented in the original image by the pixel location xi , y i .
If we transform Vc such that it is aligned with the standard world coordinate axes, we
have:

Vw = RVc

where the rotation matrix R is found as described above. The world coordinate
representation of the original image point xc , y c may be found by finding the intersection
of Vw with the ground plane.

Limitations of the System

One of the primary limitations of the tracking algorithm is the pattern recognition. The
pattern classifier has been primarily trained on the data collected during the
November/December 2001 test at the intersection of Refugee Road and Winchester Pike
in Columbus,Ohio. The degree to which the vehicle classification depends on the

19
specific backgrounds and lighting conditions encountered during this period of operation
is unknown.

Figure 13 Digital photograph classifier results from an Ohio State U. parking garage

The 80x80 image segments, which are fed into the classifier, are large enough to
encompass large amounts of background. This large classification window is necessary
because the vehicles often do not appear with significant amounts of detail (dark vehicle
casting a long shadow, for instance). In these situations recognizing a vehicle is difficult,
even for a human, without significant amounts of context on either side of the vehicle.
This means that the recognition is very background dependent. For instance, a dark blob
encountered within lane markers might be classified as a vehicle while a dark blob
encountered elsewhere is not. However, a more robust system might rely more on
internal vehicle details (windows, wheels) than on the more general vehicle body.
Large trucks remain a problem. Some large trucks are classified correctly, a light colored
truck cab with dark windows will generally result in a correct classification as shown in
Figure 14. However, in general, trucks are not classified correctly. Also, a large semi-
tractor trailer may obscure several cars from the camera’s view—in these cases, the track
is almost always lost or led astray.

All truck detections are generalizations from cars. Trucks were not included in the
training set, in either the positive or negative classes. Thus, large trucks and their trailers
produce may produce many false positives. If these false positives are persistent enough,

20
trucks may cause spurious tracks which may need to be deleted later. At some point, a
classifier designed specifically for trucks may need to be designed.

Figure 14 Tractor trailer recognition

21
22
Recognition Limitations

Although the pattern recognition-based traffic tracking implemented here is more robust
than techniques such as background estimation, the pattern recognition can be easily
misled. Different lighting conditions may still cause objects within a frame of video to
have an appearance very different from any of the training data. While the goal of the
pattern recognition is to generalize from the training set such that never-before-observed
vehicles are recognized as vehicles, lighting and shadows may still cause a recognition
error.

3
Figure 15 Dazzle paint camouflage.

23
Figure 15 shows a ship painted in a type of wartime camouflage which attempts to
disrupt recognition of the object, rather than trying to blend in with the background.
Normally, the human eye could detect the above ship many miles away, and both the type
and direction of the vessel could be easily discernable. However, the dazzle paint
camouflage disrupts the lines which normally encode information about the type and
direction of the vessel. The success and widespread use of such camouflage prior to the
advent of radar shows that in some cases, even human pattern recognition is not
necessarily robust.

Figure 16 shows a car tracking situation where the false positives caused by shadows are
a significant problem. The top image of Figure 16 shows regions of the image which are
recognized as cars, while the bottom image is identical but shows the same image with
the current tracks overlaid. The shadow cast by the upper portion of the lower right hand
utility pole is interpreted strongly and wrongly as a car. This mistaken measurement is
labelled “1” in the upper image while the corresponding track is labeled “2721” in the
lower image. In addition, the group of bushes next to the gas station in the middle of the
image is being tracked as a vehicle, and the bushes on the right side of the screen are
being tracked as a vehicle as well. The top of the lower right utility pole is generating a
false positive. Finally there are two false positive measurements in the upper left of the
image. Generally, classification performance seems to decline in the presence of strong
sunlight and shadows. Figure 17 shows the same scene under diffuse lighting conditions.
One can see that the incidence of false positives has declined.

One possible reason for the difference in classification performance between direct and
diffuse lighting is that there are fewer ways of lighting an object with diffuse lighting
than with direct lighting. In other words, the specification of direct lighting source
requires a direction and an intensity, while the specification of a diffuse lighting source
requires only an intensity. Thus, an object viewed under direct lighting has more
variation in its appearance than an object viewed under diffuse lighting. The shadows
cast by other scene objects is another major problem with direct lighting as well. Some
algorithms may attempt to eliminate some of the lighting variation by employing
techniques such as brightness plane subtraction. Currently, however, no such algorithms
are currently being used for this application.

24
Figure 16 Tracking under sunny lighting conditions

Figure 17 Tracking under diffuse lighting conditions

Possible Improvements
There are three primary ways in which the system’s performance may be improved and
generalized. First, one could place recently discovered false negative images into the
positive training set. This would allow car images which were mistakenly identified as
“not-car” to be incorporated into the classifier’s training. Similarly, one could also place
new false positives into the negative training set as well. These additions to the training
images would allow the classifier to learn from its mistakes. Second, tracking could be
improved by using the probabilistic data association techniques discussed in [20].
Currently, the data association is “nearest neighbor” with some attempts to favor high

25
confidence measurements. Third, one could try to find a method of decreasing the
lighting variation before even applying the classifier. If there were less lighting variation,
the classifier’s task would not be as difficult.

Conclusion
This vehicle classifier and tracker provide a method of studying traffic flow which has
not previously existed. At the present time, there is no other means of finding multiple
vehicle trajectories in complex traffic situations. The inherent flexibility and extensibility
of the pattern classifier at the heart of the system speaks well for operation in varied
environments.

26
References
[1] M.S. Bartlett, H.M. Lades T.J. Sejnowski. Independent component representations for
face recognition. In Proceedings of the SPIE Conference on Human Vision and Electronic
Imaging III, volume 3299 1998

[2] F. Bartolini, V. Capellini, and C. Giani. Motion estimation and tracking for urban
traffic monitoring, International Conference on Image Processing, volume 3, 1996

[3] C.J.C Burges. Simplified support vector decision rules. In International Conference on
Machine Learning,

[4] R. Collobert and S. Bengio. SVMTorch: Support Vector Machines for Large-Scale
Regression Problems. Journal of Machine Learning Research, 1:143-160, 2001

[5] T. Downs, K.E. Gates, A. Masters. Exact Simplification of Support Vector Solutions,
In Journal of Machine Learning Research 2 2001

[6] D. Koller, J. Weber and J. Malik, Towards realtime visual based tracking in cluttered
traffic scenes, In: Proc. of the Intelligent Vehicles Symposium 1994, October 1994, Paris,
France

[7] E. Osuna, R Freund, F. Girosi, Training Support Vector Machines, Conference on


Computer Vision and Pattern Recognition, Puerto Rico 1997

[8] Y. Qi. D. DeMenthon. D. Doermann, Hybrid Independent Component Analysis and


Support Vector Machine Learning Scheme for Face Detection. International Conference
on Acoustics, Speech and Signal Processing, May 2001, Salt Lake City, Utah

[9] H. Schneiderman and T. Kanade Object Detection Using the Statistics of Parts, In
International Journal of Computer Vision 2002

[10] R.A. Singer Estimating optimal tracking filter performance for manned
maneuvering targets, IEEE Tran. Aerospace Electronics Systems, 1971

[11] Z Sun, G. Bebis, and R. Miller, “Quantized wavelet features and support vector
machines for on-road vehicle detection,” The Seventh International Conference on
Control, Automation, Robotics and Vision, December, 2002, Singapore

[12] R.Y. Tsai. “A versatile camera calibration technique for high-accuracy 3D machine
vision metrology using off-the-shelf TV cameras and lenses,” IEEE Journal of Robotics
and Automation, Vol RA-3, No. 4, August 1987, pages 323-344

[13] M. Turk and A. Pentland. Eigen Faces for Recognition. Journal of Cognitive
Neuroscience, 3(1), 1991.

27
[14] H. Veeraraghaven and O. Masoud. N. Papanikolopoulos Managing Suburban
Intersections Through Sensing Technical Report Intelligent Transportation Systems
Institute University of Minnesota December 2002

[15] F. Rosenblatt “The Perceptron: A Probabilistic Model for Information Storage and
Organization in the Brain,” Cornell Aeronautical Laboratory, Psychological Review
1958, v65, No. 6

[16] N. Cristianini and John Shawe-Taylor An Introduction to Support Vector Machines


and other kernel-based learning methods Cambridge University Press Cambridge 2000

[17] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York 1995

[18] A. Mohan and C. Papageorgiou. T. Poggio “Example-Based Object Detection in


Images by Components” IEEE Trans. on Pattern Analysis and Machine Intelligence,
April 2001

[19] C. Rasmussen and G.D. Hager “Probabilistic Data Association Methods for Tracking
Complex Visual Objects” IEEE Trans. on Pattern Analysis and Machine Intelligence,
June 2001

[20] Y. Bar-Shalom and T. Fortmann. Tracking and Data Association. Academic Press,
1988

[21] A. W. Fitzgibbon and R.B. Fisher. “A Buyer’s Guide to Conic Fitting”. Proc. 5th
British Machine Vision Conference, Birmingham 1995

[22] J. Canny. “A Computational Approach to Edge Detection”, IEEE Trans. on Pattern


Analysis and Machine Intelligence 8(6) 1986

[23] P. Viola and M. Jones, Robust Real-Time Object Detection. International Journal of
Computer Vision, 2002

[24] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line


learning and an application to boosting. Journal of Computer and System Sciences,
55(1):119--139, August 1997

[25] Leonidas J. Guibas and Jorge Stolfi. Primitives for the Manipulation of General
Subdivisions and the Computation of Voronoï Diagrams. ACM Transactions on Graphics
4(2):74-123, April 1985.

[26] R.E. Kalman A New Approach to Linear Filtering and Prediction Problems.
Transactions of the ASME--Journal of Basic Engineering, 1960 vol 82 pp 35-45

[27] Z. Zhang. “A Flexible New Technique for Camera Calibration.”, IEEE Trans. on
Pattern Analysis and Machine Intelligence, 22(11):1330-1334, 2000.

28
[28] Jean-Yves Bouguet, Camera Calibration Toolbox for Matlab, April 2002. Available:
http://www.vision.caltech.edu/bouguetj/calib_doc

29
Appendix A
m
f ( X ) = ∑ λi y i ( X T X ) 2 + b
i =1
m
f ( X ) = ∑ λi y i [( xiu xiv ) (( un ,,vn))=(1,1) ] • [( xu x v ) (( un ,,vn))=(1,1) ] + b
i =1
m
f ( X ) = [∑ λi y i [( xiu xiv ) (( un ,,vn))=(1,1) ]] • [( xu x v ) ((un ,,vn))=(1,1) ] + b
i =1

f ( X ) = X T AX + b
m
Auv = ∑ λi y i [( xiu xiv ) (( un ,,vn))=(1,1) ]
i =1

30

Anda mungkin juga menyukai