Anda di halaman 1dari 165

The Automated Tracking Of Vehicles and

Pedestrians In CCTV For Use In The

Detection Of Novel Behaviour

James Anthony Humphreys

Project Supervisors: A. Hunter, N. Holliman

MSc by Research

Durham University, Department of Computer Science


This thesis describes work on the automated detection of suspicious pedes-

trian activity on outdoor CCTV surveillance footage, and in particular, the
development of a robust pedestrian tracker. Areas of movement are detected
using adaptive background differencing. These detected areas of movement
are referred to as silhouettes. Silhouettes having an area larger than a given
constant are instantiated as objects. Each object is then classified as a car or
a pedestrian by inputting several key features, such as size and aspect ratio,
into a multi-layer perceptron neural network. To track effectively, the algo-
rithm must match silhouettes found in the current frame to the objects of the
previous frame. It does this by examining the cost of matching a silhouette-
object pair based on simple features such as area, position and a histogram of
pixel intensities. The cost is estimated using several self-organising maps to
assess how ‘novel’ a matching is compared to a hand-marked reference stan-
dard. The algorithm searches through a space of possible object-silhouettes
matchings to find those which yield the lowest global cost. The object po-
sitions and features are then updated using the silhouettes to which they
are matched. This process is repeated at every frame to produce continu-
ous tracking. The system explicitly deals with merging where silhouettes of
two objects merge into a single silhouette, and fragmentation where a single
silhouette splits into several silhouettes and must be reconstituted.

1 Introduction 11
1.1 The growth of CCTV surveillance . . . . . . . . . . . . . . . . 11
1.2 Automated CCTV surveillance . . . . . . . . . . . . . . . . . 13
1.3 Context and Objectives . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Structure Of This Thesis . . . . . . . . . . . . . . . . . . . . . 17

2 Object Segmentation And Tracking 18

2.1 Tracking Applications . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Tracking System Architecture . . . . . . . . . . . . . . . . . . 20
2.3 Object Segmentation . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Background Differencing and Optic Flow . . . . . . . . 23
2.4 Object Matching and Tracking . . . . . . . . . . . . . . . . . . 36

3 Tracking Algorithm 43
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Background Differencing . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Research and Testing . . . . . . . . . . . . . . . . . . . 45
3.2.2 Final Design . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2.3 High-level feedback . . . . . . . . . . . . . . . . . . . . 62
3.3 Object Matching . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.2 Cost Function . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.3 Object Merging & Partitioning . . . . . . . . . . . . . 87
3.3.4 Conflict Resolution . . . . . . . . . . . . . . . . . . . . 99
3.3.5 Search Function . . . . . . . . . . . . . . . . . . . . . . 104
3.4 Object Classification . . . . . . . . . . . . . . . . . . . . . . . 116
3.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 122

4 Results And Comparisons 123

4.1 Developing a reference standard . . . . . . . . . . . . . . . . . 124
4.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.1.2 Potential Deficiencies . . . . . . . . . . . . . . . . . . . 131
4.2 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . 135
4.2.1 Individual SOMs . . . . . . . . . . . . . . . . . . . . . 136
4.2.2 Overall performance . . . . . . . . . . . . . . . . . . . 141

5 Conclusion 157
5.1 Objectives and Achievements . . . . . . . . . . . . . . . . . . 157
5.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

List of Tables

3.1 The 2-phase training of the SOMs. The first phase learns the
coarse structure, whilst the second fine-tunes the individual
neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.2 A tabular summary of the 3 SOMs and their inputs . . . . . . 79
3.3 Confusion matrix of the basic area-based Bayesian classifier . . 118
3.4 The confusion matrix for the MLP-based classifier, with a sin-
gle hidden layer of 4 units, on a test set of 10517 cases . . . . 120

4.1 Table of all 4 possible tracker/reference standard match com-

binations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

List of Figures

2.1 A typical surveillance pipeline . . . . . . . . . . . . . . . . . . 21

2.2 The recursive two step predict-correct Kalman cycle . . . . . . 38

3.1 An overview of the tracking algorithm . . . . . . . . . . . . . 44

3.2 Example results from a Lucas Kanade Optic Flow Algorithm
– 2 successive frames, and the results from Lucas Kanade and
Block Matching, respectively . . . . . . . . . . . . . . . . . . . 49
3.3 Example results from the iterative pyramidal Lucas Kanade
approach – points chosen in the first image are tracked to their
equivalent points in the second image . . . . . . . . . . . . . . 51
3.4 2 Comparative results of pedestrian segmentation. On the left
is the original image, the middle is background differencing
and the right image is the MoG result . . . . . . . . . . . . . . 54
3.5 Comparison of performance of Background Differencing (left)
vs MoG (right) with camera judder . . . . . . . . . . . . . . . 56
3.6 An overview of the background differencing algorithm . . . . . 58
3.7 Background differencing followed by morphological noise re-
duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.8 An example of a challenging object-silhouette matching with
merging and fragmentation . . . . . . . . . . . . . . . . . . . . 65
3.9 A breakdown of the object tracker algorithm . . . . . . . . . . 68

3.10 A histogram of the costs of reference (‘correct’) matches, to
aid the choice of value of the γ cost variable . . . . . . . . . . 82
3.11 The effect of different values of γ on the number of ‘extra’ and
‘lost’ objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.12 The PCA object partitioning algorithm . . . . . . . . . . . . . 90
3.13 Projecting a pixel onto the PCA line using the dot product . . 91
3.14 The PCA-based partitioning of two merged pedestrians. The
merged silhouette is taller than it is wide, causing partitioning
to fail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.15 Results from a scaled PCA partition on real data . . . . . . . 93
3.16 Results from a scaled PCA partition on artificially merged
silhouettes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.17 Angle-search method – (a) 30 different angles are assessed,
and (b) the best angle is chosen . . . . . . . . . . . . . . . . . 94
3.18 Graph plotting the total cost as different angles were tested.
A very clear dip is visible in the centre of the graph where the
true best result lies . . . . . . . . . . . . . . . . . . . . . . . . 95
3.19 A sample of four partitions tested to assess the quality of the
search algorithm. (a) Partitioning 2 cars, (b)(c) 2 pedestrians,
(d) 2 pedestrians and a car . . . . . . . . . . . . . . . . . . . . 96
3.20 An illustration of two poor partitionings, taken one second apart 97
3.21 A candidate match matrix, Mc , illustrated as a bipartite graph 100
3.22 An overview of the steps taken by the conflict resolution module101
3.23 A valid match matrix, V , illustrated as a bipartite graph. The
search space can be divided into 2 along the dotted line. . . . 107
3.24 The likelihood of an object being a pedestrian or vehicle, given
its area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

3.25 The final Multilayer Perceptron design, with a single hidden
layer of 4 units, for classifying an object as pedestrian or ve-
hicle on the basis of its basic features . . . . . . . . . . . . . . 119
3.26 A pedestrian misclassified as a car, due to its being segmented
together with an open car door . . . . . . . . . . . . . . . . . 121
3.27 A car misclassified as a pedestrian, due to the front end of the
car being outside the shot of the camera . . . . . . . . . . . . 121

4.1 The main tracker window, asking the user whether to incorpo-
rate an object into the background. Three input boxes allow
the user to input frame numbers where this should occur early. 126
4.2 An example of a reference standard difference image. In this
case, frame 307 of the sequence known as ‘seq003’ . . . . . . . 127
4.3 The reference standard creation program . . . . . . . . . . . . 128
4.4 A poorly segmented pedestrian. Whichever match matrix is
chosen for the reference standard, the result will always be
unsatisfactory . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.5 Performance of the motion SOM, given a specific point and
speed but different angles of motion. . . . . . . . . . . . . . . 137
4.6 Performance of the comparative SOM, testing the effect of
changing pArea on the output cost. Vehicle A is in the centre
of the scene, about to park. Vehicle B is just entering the scene.139
4.7 Performance of the appearance SOM, testing the effect of
changing aspect ratio on the output cost. Pedestrian A has
just exited his/her vehicle. Pedestrian B is in an unobstructed
area of the scene. . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.8 The two pedestrians used to capture the data for figure 4.7 . . 141

4.9 The percentage of object matches within a distance X of the
reference standard, using the four combinations of cost and
search function modules in the test set . . . . . . . . . . . . . 143
4.10 The average match distances, comparing the Owens/SOM cost
functions, and exhaustive/greedy search modules in the test set.145
4.11 The percentage of object matches within X flips of the refer-
ence standard match matrix, using the four combinations of
cost and search function modules . . . . . . . . . . . . . . . . 146
4.12 The number of ‘lost’ and ‘extra’ objects through the test se-
quence, using the four combinations of cost and search func-
tion modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.13 The number of orphan objects through the test sequence, using
the four combinations of cost and search function modules . . 148
4.14 The percentage of object matches within a distance X of the
reference standard, using the four combinations of cost and
search function modules in the selection set. . . . . . . . . . . 150
4.15 The average match distances, comparing the Owens/SOM cost
functions, and exhaustive/greedy search modules in the selec-
tion set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.16 The number of lost/extra/orphan objects in the selection set,
using the four combinations of cost and search function modules.152
4.17 A pedestrian emerging from partial occlusion is fragmented,
yielding a new orphan object created by the silhouette of the
legs. The video image with MBRs is on the left, with the
difference image on the right. . . . . . . . . . . . . . . . . . . 153
4.18 Four frames after the example illustrated in figure 4.17, the
tracker has corrected the original segmentation error. . . . . . 153

4.19 A couple of pedestrians tracked as a group. . . . . . . . . . . . 154
4.20 An ‘actor’, walks around the car park suspiciously without any
apparent ill effects on the quality of the tracking. . . . . . . . 155


I hereby declare that the dissertation, submitted in total fulfilment of the

requirements for the degree of Master of Science and entitled ‘The Automated
Tracking Of Vehicles and Pedestrians In CCTV For Use In The Detection
Of Novel Behaviour’, represents my own work and has not been previously
submitted to this or any other institution for any degree, diploma or other

Statement of Copyright

The copyright of this thesis rests with the author. No quotation from it
should be published without their prior written consent and information
derived from it should be acknowledged.


Many thanks to Andrew Hunter whose expert guidance and continuing sup-
port enabled me to complete this work.

Chapter 1


1.1 The growth of CCTV surveillance

In the past decade, the use of CCTV surveillance has grown enormously. It is
now estimated that there are over 4 million cameras scattered throughout the
UK1 . Those living in London are likely to be caught on CCTV camera 300
times a day. The UK is now often said to be the most surveillance oriented
country in the world – it has the highest ratio of cameras to people anywhere
in the world. These cameras cover both public areas such city centres, car
parks, areas that are prone to crime, and private areas such as shops and
nightclubs. The proliferation of CCTV surveillance is due partly to the falling
costs of the hardware and partly to the sentiment on the part of the public
that there is a greater need for security on the streets. The effects of CCTV
on crime figures are debatable, as found in a report by Phillips[21], which
provides examples of strong reductions in crime, mixed results and negligible
effects. However, in a study of the effects of CCTV in town centres, Brown[5]
found that areas of Newcastle in which CCTV had been installed reported
Source: Liberty at

a drop of 56% in monthly burglary figures – with a similar pattern in the
number of criminal damage incidents. In Taunton, where CCTV equipment
was installed in car parks, motor vehicle theft fell by over 50%[19]. In a
case by case analysis on the effect of CCTV in car parks, Tilley [23] found
‘quite strong evidence’ that it led to reductions in various categories of car
crime. A number of high-profile cases where television appeals have led to
the capture of criminals have bolstered CCTV’s reputation as an effective
crime fighting tool. Clearly, all this increased surveillance comes at a cost.
The hardware required, in terms of cameras, relay equipment, video storage
and radio equipment, accounts for only a fraction of the cost when one takes
into account the need to employ an operator. Whilst some cameras only
record information for potential later use by police, many require constant
monitoring by an operator so that the police or other authorities can be
contacted in a timely manner in the event of a crime.
One operator can be responsible for several cameras at once. The max-
imum number of cameras a single operator can effectively monitor varies
widely depending on the source of the research[6]. Although research by the
Police Scientific Development Branch (PSDB)[25] and other organisations
were unable to place concrete bounds on this number, they found that CCTV
operators generally believe that the maximum number of camera views they
can effectively monitor is 16 or less, with over half of operators placing the
maximum number between one and four scenes. Research suggests that as
the number of scenes monitored and their complexity increases, the likeli-
hood that suspicious behaviour will be picked up by the operator decreases.
This is in line with earlier research by Tickner and Poulton (cited within the
PSDB research document) comparing the accuracies obtained when 4, 9 and
16 monitors where monitored, leading to performances of 83%, 84% and 64%

respectively. Similarly, it should be noted that quiescent scenes can cause
drops in operator accuracy due to the loss of attention on the part of the
One possible reason that it is so difficult to place an upper bound on
the number of cameras per operator is that it is highly dependent on issues
such as the complexity of scenes being viewed, the number of scenes per
monitor, monitor size, the required rate of detection and the competence
of the operator. The upper bound is extremely dependent on the specific
situation in which the CCTV is being used, and is therefore generally left to
the discretion of the system manager.
Research that tries to place figures on the period of time that an operator
can safely monitor scenes suffers from a similar problem. Again, due to the
specificity to the situation in which the system is used, this is left to the
discretion of the system manager. Figures seem to range from 30 minutes to
2 hours, with breaks of 5 to 15 minutes (from PSDB guidelines[25]).
This research demonstrates that the manpower required to view CCTV
cameras is considerable. As the number of cameras in use increases, so too
must the number of trained operators and their consequent cost.

1.2 Automated CCTV surveillance

To ease the burden on CCTV operators a number of automated CCTV sup-
port technologies have been, and are still being, developed. Video Motion
Detection Systems monitor cameras for signs of activity and, for example,
the relevant camera can be highlighted to attract the attention of the opera-
tor. Offline recording systems can save space by only recording those scenes
in which the motion detector has detected movement. As a refinement, the

areas showing motion can be highlighted and a bounding box placed around
the offending area so that the operator can immediately zero in on the ap-
propriate zone. The technique is of course only useful in relatively quiescent
scenes since placing a box around all pedestrians in a crowded city centre
would clearly add to the confusion of the scene rather than alleviate it. The
system is most useful when dealing with extremely quiet scenes in which
operator’s attention might drift. Unfortunately, it tells us very little other
than that there is visible motion – it tells us nothing about how suspicious
the behaviour is. Modern tracking systems attempt to learn patterns of be-
haviour of pedestrians in order to behave to differentiate between usual and
unusual patterns of activity. By detecting unusual patterns, these systems
aim to detect those situations which are most likely to require the operator’s
attention. To learn and detect these patterns of activity, tracking systems
often rely heavily on a robust real-time object tracker capable of reliably
differentiating between vehicles and pedestrians. The object tracker is one
the most important parts of the system and yet also the most prone to error.
This project addresses these problems by creating a robust tracking system
with object classification and novelty detection firmly in mind.

1.3 Context and Objectives

This thesis owes many of its basic concepts to a body of work produced
by Jonathan Owens[20]. Owens introduces an intelligent surveillance sys-
tem, capable of tracking vehicles and pedestrians using adaptive background
differencing. In Owens’ system, foreground areas are grouped together us-
ing connected components to produce contiguous areas known as silhouettes.
A feature matching algorithm (based on the Owens cost function, presented

later) establishes temporal correspondence between segmented objects to pro-
vide continuous tracking of the objects’ centroids as they traverse the scene.
The system is designed to act as an attention-focussing filter for an operator.
Activity which the system considers to be suspicious is highlighted onscreen
to draw the attention of the operator. In this way, a single operator is safely
able to view a larger number of screens – or the same number of screens with
greater accuracy. The behaviour classification is based purely on the tracking
of the object centroids and is partitioned into local and global motion. A
SOM-based neural network is used to classify local object motion based on
instantaneous motion and a recent window of movement. A hierarchical self-
organising neural network learns global activity patterns, in order to classify
paths as either normal (non-suspicious) or novel (suspicious). Clearly, the
accuracy of the behaviour classification is heavily dependent on the quality
of the tracking system, and it was noted that in this area the Owens surveil-
lance system could be significantly improved. Broadly, it is this body of
work’s objective to improve the performance of that tracker, drawing on the
original Owens tracker for inspiration.
More precisely, the objectives of this project are as follows:

• To create a robust real-time pedestrian and car tracker capable of track-

ing several objects simultaneously on a static CCTV scene and reliably
classifying objects as cars or pedestrians.

• The tracker must run in real-time on standard desktop hardware and

use standard CCTV greyscale equipment.

• The system must be capable of being extended to provide novelty de-

tection, as detailed in work by Jonathan Owens and Andrew Hunter

• To qualitatively and quantitatively analyse the performance of the de-
veloped tracking system and, where possible, to justify design decisions
using statistical data.

1.4 Achievements
A robust real-time tracker capable of tracking and differentiating between
vehicles and pedestrians is presented herein. The system runs at 4Hz on
standard CCTV footage with a greyscale image of 640x480 resolution. The
modular architecture of the system allows independent analyses and improve-
ments to be made on each of the system’s modules. Around 4.5 hours of
video with hand-marked object positions create a ‘reference standard’ against
which the tracker’s performance can be verified. The reference standard also
provides the means to train the tracker’s novel Self Organising Map (SOM)
based cost functions (as detailed later), leading to an improvement in perfor-
mance over more traditional methods. Analyses of each of the modules and
the system as a whole provide statistical confirmation that the developed
tracker is indeed robust.

1.5 Structure Of This Thesis
Chapter 2 begins with a survey of the literature surrounding the field of object
tracking and automated surveillance. Chapter 3 introduces the design of the
tracker, beginning with an overview of the architecture and then examining
each module of the system in turn. A description of the development of the
reference standard and the subsequent statistical analyses of performance
follow in Chapter 4. Chapter 5 provides a conclusion, and discusses potential
further work.

Chapter 2

Object Segmentation And


With a view to achieving the objectives outlined previously, a review of rel-

evant specific fields and of the work carried out by others pursuing similar
goals has been carried out. This chapter will begin by briefly examining
concepts that are central to most forms of tracking, and move onto to dis-
cuss previous work and research, broken into the two broad areas of object
segmentation and temporal object matching.

2.1 Tracking Applications

Interest in tracking and intelligent surveillance has grown a great deal in the
past few years, probably thanks to recent research successes, an increase in
the power of hardware and its wide field of applicability. Visual tracking can
be put to an enormous range of uses. Pfinder, developed by Wren et alii[27],
is able to track a person and individual body parts. Using this as a basis,
it has been put to uses such as gesture recognition systems including one

that is a able to recognise a forty-word subset of the American Sign Lan-
guage with great accuracy. It has also been used to track body movements
to control video games and to immerse people into distributed virtual reality
environments populated by artificial life (such as the ALIVE space). Similar
systems have also been applied to athletic performance, where visualisation
and analysis of body movements can lead to improvements in technique. Vi-
sual driving assistance systems have been developed using visual tracking
and are designed for applications such as lane departure warnings, lane keep-
ing, collision warning or avoidance, adaptive cruise control and low speed
automation in congested traffic. These vision systems are sometimes also de-
pendant on others sensors such as RADAR and LIDAR (Light Detection And
Ranging) for accurate distance measurements, but rely solely on the camera
data for semantic information.1 Some techniques, such as that developed by
Zhao et al.[28], rely exclusively on data from non-visual systems to perform
tasks that are usually the preserve of purely visual systems. Pedestrians are
tracked in open areas such as malls using several laser-range scanners. The
laser range scanners are placed close to the floor and scan in the horizontal
plane to capture legs and features of the environment. Each laser range scan-
ner feeds its data to a client computer which segments movings objects using
background subtraction. The data from several clients is then collected by
a central server computer, which collates the data into a global coordinate
system and tracks people by the movements of their legs.
Returning to the traffic theme, ANPR (Automatic Number Plate Recog-
nition) is an established technology that, as its name implies, is able to
read number plates and is now widely used by police forces in the UK. The
system is mobile and can be mounted inside police vehicles. The system

autonomously reads number plates in transit and checks several databases
including the Police National Computer, the DVLA and police intelligence
records to assess whether the vehicle is wanted by police. If the vehicle is
of interest to the officers, the system alerts its users. Moving closer to the
area addressed by this thesis, multiple pedestrian tracking itself has many
uses other than intelligent surveillance and crime prevention. By tracking
pedestrians moving in city centres, town planners are able to model traffic
flows and are therefore better equipped to plan future improvements in in-
frastructure for pedestrians. In stores, it can be used to count pedestrians
entering and exiting and trace the paths that customers take around the
shop. This is information is clearly extremely useful in market statistics and
in the design of store layouts. In short, there is an enormous number of
applications for tracking, and in particular the tracking of pedestrians and
vehicles. This huge variety in requirements has in turn led to a large varia-
tion in tracking methods. The following sections will provide an overview of
the methods used in the field of visual tracking systems, with a sharp focus
on the tracking of pedestrians and vehicles.

2.2 Tracking System Architecture

As discussed in the above survey of uses of tracking, a wide variety of appli-
cations of tracking – and therefore tracking techniques – exists. Whilst they
all vary in their methods, several central themes are common to most visual
systems. A typical processing pipeline is shown in figure 2.1.
After the image is acquired, it should be noted that the image is some-
times subjected to conditioning to remove unwanted noise. This could be
a filter such as the median filter, though thanks to the improved quality

Figure 2.1: A typical surveillance pipeline

of video capture and transmission the need for this step has been greatly
reduced and does not appear in the above figure.
The design of a tracking system is critically dependent on the form and
type of image data to be processed, and the requirements that are made on
it. For example, the input image may be in colour or greyscale and have
varying degrees of resolution. The viewing platform may be static, moving
independently (such as on an aircraft) or be on a pan/tilt stand so that the
view is remote controlled. In many examples of research reviewed, several
independent cameras were used in concert to create coordinated coverage of
a large area. Also to be taken into account are the lighting conditions of the
scene which are of course heavily influenced by whether the camera is indoors
or outdoors. The flexibility and complexity of the tracking algorithm is also
constrained by the available computing hardware, and it is often a question
of striking a balance between model complexity and speed.

2.3 Object Segmentation

Stepping back from the details of the image processing pipeline and exam-
ining it more broadly, the entire process can be visualised as a step by step
reduction of redundant information until only the most salient information re-

mains. The accuracy of the tracker is of course highly dependent on whether
the models in each step of the algorithm reflect the underlying structure of
the data, and can accurately distinguish what is salient from what is not. In
the case of this object tracker, the aim is to extract a history of centroids
of objects passing through a scene and their nature (i.e. car or pedestrian).
The object segmentation stage serves to focus attention of the tracker only
on those areas that are in motion, helping to discard large amounts of the
image and to highlight areas of the image that belong to an object.

2.3.1 Overview

This section provides a survey of current techniques in object segmentation.

Two main conceptual techniques for segmentation are presented here. The
first is that of background differencing, which involves comparing the current
frame with some form of background (or reference) image. The result of this
is a binary image with only those pixels considered to be different from the
reference image having a value of ‘1’, representing foreground pixels. Sys-
tems vary greatly as to how this is implemented, and how the reference image
is implemented, as will be covered shortly. The second technique is known
as optical flow. This technique compares the current frame to the previous
frame, and attempts to establish a correspondence between small blocks in
the two frames in order to assess the motion within the image. The output
is a vector field – a matrix of vectors – where the system believes areas of the
previous frame have moved to in the current frame. The resolution of this
matrix is often a great deal smaller than that of the input image. The fol-
lowing section begins by introducing the background differencing technique,
its commonly encountered problems and goes on to critically appraise re-
cent research works and their ability to overcome these difficulties. Optic

flow-based techniques are briefly covered at the end of this section.

2.3.2 Background Differencing and Optic Flow

After the image is acquired, areas containing objects need to be identified.

One of the oldest and still very widely used technique is that of background
differencing in its various incarnations. It is so widely used because of its
relative simplicity, effectiveness and (usually) low computational cost. In
its simplest form a reference image – an image of the background of the
scene – is taken of the empty scene before tracking begins. To find fore-
ground objects, a pixel-wise comparison between the reference image and
the current image is performed. If the difference in intensities is above a
certain threshold, the pixel is marked as foreground. Whilst reliable in con-
trolled indoor conditions, this simple technique has many serious drawbacks.
The most obvious and severe is that it is totally incapable of adjusting to
changes in background illumination levels. Much research has been aimed at
analysing and overcoming the difficulties inherent in background differenc-
ing. Ten canonical problems have been identified as listed in Javed et al [12],
and analysed in greater detail by Toyama et al.[24]. Javed et al. concentrate
on those problems that still pose the greatest difficulties even in modern seg-
mentation algorithms (marked with [J]). Although these problems are listed
in the context of background differencing, many of these problems are also
symptomatic of other segmentation techniques.

• Slow Illumination Changes: These can be caused by a gradual

changing of lighting conditions over the course of a day, or perhaps
slow changes in cloud cover. This can cause the reference image pixel
values to no longer accurately reflect the conditions.

• Quick Illumination Changes[J]: Also know as Light switch, this is
caused by a sudden change in illumination as caused by lights being
switched on/off or possibly by cloud activity. These first two prob-
lems are kept distinct despite their similarity because very different
techniques are required to combat them.

• Relocation of Background Objects [J]: Objects that are initially

part of the background or are outside the scene may move and be-
come part of the background elsewhere. They should not be forever
recognised as foreground objects.

• Initialisation with moving objects [J]: The classic scenario for this
is that a person is walking across the scene as the system initialises.
Often, both the moving object and the space it used to occupy are
identified as foreground when only the object itself should be. Related
to bootstrapping (see below).

• Shadows [J]: Shadows cast by moving objects are often accidentally

segmented as moving objects themselves.

• Distractors: Distractors refer to objects that are considered back-

ground but are not completely static. These may be waving trees (also
an alternative name for this problem), the ripples on the surface of
water or reflections.

• Camouflage: This problem is caused when foreground objects, or

parts of them, have a similar colouration to the background. This
causes them to blend into the background and not get picked up by the

• Bootstrapping: Bootstrapping is required when a period of inactivity
for the scene is unavailable and a clean reference image could not be
obtained. This could, for example, involve generating a reference image
over a period of time, despite the presence of foreground objects.

• Foreground Aperture: Foreground aperture occurs when a uni-

formly coloured object moves slowly across the scene. In the centre
of the object, no movement is visible because of the lack of texture.
Movement has to be inferred from border movements. This is not
a major problem in background differencing, but is more relevant to
techniques such as optical flow.

• Sleeping person: Foreground objects may stop moving within the

scene. They should not be incorporated into the background. The
solution to this problem can be at odds with that of the relocation of
background objects. If objects in the earlier problem are to be inte-
grated into the background and these objects not, criteria by which to
differentiate between the two situations must be defined.

The simple background differencing algorithm as described above suffers from

all of these problems except of course foreground aperture. Koller et al.[16]
and Hunter et al.[11] use an adaptive background model, enabling the system
to adapt to slowly changing lighting conditions. The background is updated
frame by frame according to the following update equation, shown in figure
2.1 in Kalman filter formalism:

Bt+1 = Bt + αDt (2.1)


 0.1 if pixel was classified as background
 0.01 if pixel was classified as foreground

The figures given here for α are typical values only. B(t) is the background
at time t, and D(t) is the difference between the reference image and the
current image. Areas identified as foreground by the differencing step are
updated more slowly than background pixels to prevent foreground objects
from unduly altering the background. This is essentially a temporal low-
pass filter, since long-term objects exert a large effect on the reference image
whilst transient objects have little influence. Foreground pixels are allowed a
small influence to prevent misclassified pixels from remaining so indefinitely.
This concept of continual background adaption is often referred to background
maintenance, and is crucial to address the problem of gradual illumination
changes. In addition, Koller et al. apply a gaussian filter to the image
prior to background subtraction to reduce the influence of noise. An obvious
deficiency of this model is that it relies on a global threshold when this may
not be not the best model of the underlying data. For example, in well
illuminated sections of background there will likely be a larger variation in
pixel values than in shadowy areas due to the increased contrast.
Essentially, the reference image in this case is a set of mean pixel in-
tensities, which are corrected every frame (to follow the Kalman predictor-
corrector formalism). As a refinement to the model, Pfinder, developed by
Wren et al.[27], measures the covariance of pixels and sets a local thresh-
old using the Mahalanobis distance. By setting the threshold globally and
statistically rather than by hand, the threshold can be tighter and better
tailored to the conditions. The variance of a pixel’s intensity can be updated
in similar way to its mean. Equation (2.2) shows the pixel by pixel update
equation for variance on a greyscale image:

σt2 = (1 − ρ) × σt−1 + ρ × (Xt − µt )2 (2.2)

ρ ≃ 0.05 (typically)

The threshold at a given pixel can then be defined as intensities that are
within 2.5 standard deviations of the mean. This allows a tight threshold
for areas of low noise whilst the threshold is relaxed for areas that display
high variance. This adaptability also help to mitigate the effect of the waving
trees problems, as higher variances are learned for these areas.
Instead of maintaining mean and variance for individual pixels, W 4 , de-
veloped by Haritaoglu et al.[8], records minimum intensity (M), maximum
intensity (N) and maximum absolute interframe difference (D) for each pixel.
Pixels are classified as foreground only if equation 2.3 is true.

|M (x) − I(x)| > D(x) ∨ |N (x) − I(x)| > D(x) (2.3)

The minimum, maximum and interframe difference are learned during

a short training phase with no foreground objects present. Once tracking
has begun, the background model is updated every few seconds. Only those
pixels identified as background are updated to prevent foreground objects
corrupting the background history. Morphological erosion followed by di-
lation is then applied to remove foreground that is produced by noise. A
connected components algorithm then groups the pixels into labelled regions
and only those regions that are above a certain size are kept. By putting
a threshold on the size of regions, the algorithm aims to eliminate patches
created by noise or changes in illumination. As with the mean and covariance
algorithm, areas with generally high variation are less likely to be misclassi-
fied as foreground.
The algorithms listed above all assume that the background model is uni-
modal – that is that the intensity of a given pixel has a single mean colour

value. In the mean and covariance algorithms, each pixel is modelled by a sin-
gle gaussian with a mean and covariance. Stauffer and Grimson[22] present
a Mixture of Gaussians (MoG) algorithm that instead allow a multi-modal
representation of the background. This algorithm is now widely used as a
basis for more complex segmentation algorithms. As demonstrated in the
Stauffer and Grimson[22] paper, if pixel intensities are plotted on a graph
over time they often display several clusters. This is particularly true of
repetitious and structured background movement, such as the ripples on a
water’s surface or the blinking of an LCD screen. A number, k, of gaussian
distributions, each centred over a mean value of the pixel, is maintained in
order to try to ‘explain’ a current pixel value. Each gaussian is weighted ac-
cording to how frequently it best ‘explains’ the background. The gaussians
are sorted in order of weight divided by variance. In this way, the algorithm
favours those gaussians that have frequently been the best representation of
the background and which have low variance. The algorithm then identifies
which gaussians best represent the background by picking the first b distri-
butions such that their combined weights add up to less than a threshold T.
The higher T is, the more gaussians are likely to be included and therefore
the more multi-modal the background is allowed to be. Each gaussian is
now examined in order until a match with the current pixel value is found.
A match is defined as being within 2.5 times the standard deviation of the
mean. If the matched gaussian was not identified as being part of the back-
ground, the pixel is flagged as foreground. If no gaussians match, the last
(‘weakest’) gaussian distribution in the list is replaced by a new distribu-
tion with a mean equal to the current pixel value, a low weight and a high
variance. If a gaussian is found, the matched gaussian’s mean and variance
is updated in a manner similar to the mean and covariance algorithms but

weighted to the likelihood of this gaussian being the best match. The weight
of this gaussian is also increased relative to the other gaussians.
In this way, gaussians that are often matched have high weight and are
continually adapted to new lighting conditions. If the lighting conditions
suddenly change, a new distribution may be created to model this. Initially
the distribution has low weight but if this new distribution is stable the weight
will increase until it is considered part of the background. This algorithm is
therefore able to adapt to gradual illumination changes. Sudden illumination
changes will cause the entire view to be considered foreground, until sufficient
adaptation occurs. Similarly, it can adapt to the relocation of background
objects as the algorithm learns the new background values. The same process
that allows for this learning can also causes foreground objects which remain
stationary for too long to fade into the background and tracking can be lost.
When the foreground object begins to move again, a foreground ‘hole’ is left
behind where the background expects the person to be. The algorithm does
however cope well with distractors such as moving foliage where it learns that
these pixels are multi-modal and have high variance. It copes particularly
well with distractors that are truly multi-modal such as flashing construction
lights, but it should be noted that not all distractors are truly multi-modal.
A detailed statistical analysis by Gao et al.[7] showed that the MoG model
is indeed a better representation of the background than the simpler mean
and variance (single Gaussian) approach. This technique was implemented
during early testing phases, the results of which are discussed in section 3.2.1.
The Wallflower algorithm, developed by Toyama et al.[24], splits the ob-
ject segmentation stage into three levels: pixel, region and frame. At the
pixel level, a one step Wiener prediction filter predicts what background val-
ues are expected in the next frame. It uses a recent history of values to

do this (typically around 50) and any pixel found to deviate significantly
from this prediction is classified as foreground. The expected squared error
value is calculated and a local threshold is set at four times the root of the
expected squared error value. The linear predictor is well suited to mod-
elling periodically varying pixel values such as LCD screens, and will learn
higher expected squared error value for areas of high variance that may con-
tain waving trees, for example. It does not, however, explicitly model the
background as a multi-modal entity. The region level backprojects the fore-
ground’s histograms locally. Therefore, homogenously coloured foreground
objects which were not fully segmented at the pixel level are more completely
segmented on the basis of their colour. At the frame level, several background
models are maintained which have been learnt during a training phase using
a k -means clustering algorithm. The most appropriate background is then
chosen as that which produces the fewest foreground pixels. The problem
with this frame-wide approach is that the light switch problem is only dealt
with effectively if the new lighting conditions are known a priori.
None of the models mentioned thus far has explicitly dealt with shad-
ows. Whilst many of these algorithms have used RGB colour space, Pfinder,
mentioned previously as the maximum interframe difference system, uses the
YUV colourspace. In the YUV colour space, originally developed so that
colour TV signals would be backwards compatible, luminance (Y) is sepa-
rated from chroma (UV). This colourspace is useful because whilst shadows
have a tendency to darken a pixel, they have a much smaller effect on the
colour of it. Harnessing this fact, Harville[10] presents an adapted Mixture
of Gaussians algorithm that attempt to detect and correct for lighting effects
such as shadows and moved background objects. Each pixel is represented
in YUV colourspace with an additional Depth component, using a real-time

stereo camera setup. The depth component is obtained using stereo corre-
spondence between the left and right images, using techniques which bear
a resemblance to those used in optical flow. The MoG algorithm is then
implemented in YUVD space, with severable notable differences. Since the
chroma components become unreliable at very low luminance, the U and V
components are ignored when comparing the current observations with the
background model in low lighting conditions. Similarly, depth measurements
are sometimes flagged as invalid when depth measurements are known to
be inaccurate. This tends to occur in homogenously-coloured areas, since
depth is often measured using correspondence of texture between the two
cameras. Shadows are specifically addressed in the following ways. The sys-
tem’s model of the variance in the brightness (luminance) of pixels is not
allowed to fall below a substantial floor level. This allows pixels to vary
their brightness, whilst still ensuring that pixels whose chroma differs signif-
icantly are flagged as foreground. Also, where depth is a reliable match the
colour match threshold is increased. This allows areas in strong shadow, and
sometimes even areas of moving foliage, to remain classified as background.
To reduce the influence of foreground objects, in areas exceeding a tempo-
rally smoothed interframe difference by a set threshold, the learning rate is
greatly reduced. If there is no reliable depth measurement, the gaussians
best representing the background are selected as described by Stauffer and
Grimson[22]. If reliable depth is available, the algorithm favours gaussian
distributions with the largest depth. This reflects the implicit assumption
that the furthest object in a scene tend to be the background. This is par-
ticularly useful in the relocation of background objects problem. If a chair
is moved, for example, the area it used to occupy will become background
more swiftly without compromising the foreground segmentation. Results

from this algorithm appear to be encouraging. Repetitive background dis-
tractors are largely ignored, shadows are often identified and removed, and
areas of high traffic causes little ‘damage’ to the background model thanks
to the model’s preference for distant backgrounds.
KaewTraKulPong et al.[14] present in their paper an improved adaptive
background mixture model (MoG) with shadow detection. Whilst the al-
gorithm is closely related to the original MoG algorithm (with new update
equations), it introduces a computationally inexpensive shadow detection
module. Converting an entire image to YUV colourspace can be computa-
tionally expensive. Instead, for each non-background pixel a brightness dis-
tortion and a colour distortion are calculated (relative to the background),
and each distortion must be within a set threshold for the pixel to be flagged
as shadow. The underlying mathematical model of shadows is near identical
to that presented by Harville.
Javed et al.[13] use gradient cues to detect shadow regions. Firstly, the
algorithm picks out only those foreground areas that are darker than the
reference image. These pixels are then grouped into connected regions. Edges
and gradients are good illumination invariant features. The gradient vector
for the background and the current image are calculated. Whilst the gradient
intensity might change a great deal, the vector direction will not change
much. Areas whose gradient direction is well correlated with the background
are considered shadow. This detects shadows cast from foreground objects
but is able to ignore shadows falling onto foreground objects.
Most of the algorithms presented so far have treated each pixel indepen-
dently (or within a small local region) with no higher level processing, except
the Wallflower algorithm. Recent improvements in background segmentation
have focused on extending current background maintenance by using high-

level modules such as those in Wallflower, and using feedback to enhance the
underlying background model. The form of high-level feedback depends on
the type of tracking being performed. Recognising this, Harville[9] presents
a general framework for high-level feedback and provides an example of its
use in a realistic scenario. Positive feedback is produced in areas where the
system is confident the foreground is correctly segmented, and negative feed-
back where it is likely to have been incorrectly segmented as foreground.
High-level modules produce pixel maps of positive and negative numbers.
These maps are then added together pixel-wise so that positive evidence can
override negative and vice versa. This map is then thresholded to produce
two binary masks, one for positive feedback and one for negative. The goal of
positive feedback is to prevent foreground objects from influencing the back-
ground. Therefore, all background pixels identified by the positive feedback
are not updated. This is especially useful for mitigating the effect of station-
ary foreground objects and the effects of high-traffic on the background. The
goal of the negative feedback is to identify background that was incorrectly
identified as foreground. Therefore, a gaussian model is used to model those
incorrectly identified pixels. At rates of 0.2 to 1Hz, this error model is merged
with the current background model. In this way, the background model is
helped to more accurately reflect the background. Negative feedback in the
form of a frame-wide illumination change detector can help the model to
adapt quickly to sudden illumination changes, with positive feedback such
as a pedestrian tracker overriding the feedback for areas in which a person is
present. This enable continuous tracking even in difficult lighting conditions.
Javed et al.[12] describe a pixel level, region level and frame level break-
down of segmentation similar to that introduced in the Wallflower paper,
though the implementation differs greatly. The pixel level is performed by a

standard MoG together with gradient-based background differencing. High
gradient-based differences will tend to occur mostly at the edges of fore-
ground objects. Both colour and gradient information are combined at the
region level. Pixel-level foreground object are grouped into connected com-
ponents to form regions. The region’s boundary pixels are considered. Only
regions which exhibit both high gradient difference and high gradient in the
current image at their boundaries are classified as correctly segmented and
are kept as foreground. Those identified as incorrectly segmented have their
best matching gaussians’ weight increased to prevent a recurrence of the mis-
classification. This reliance on gradient prevents shadows and other lighting
effects, which tend to have fuzzy low gradient edges, from being misclassified.
It also helps correctly classify moved background objects, as the area vacated
by the object may no longer display high gradient at its edges and will be
subsumed into the background. The frame level simply detects frame-wide
illumination changes and switches to gradient-only subtraction in such a case.
Optic flow techniques [4] rely on matching blocks of fixed size (5 x 5
windows or bigger) on two consecutive image frames. This algorithm seeks
to match a block of the first image with a block of the second image by
minimising squared difference of intensities inside the considered window.
The output of these techniques is generally a pixel map of vectors together
with a measurement of the validity of the measurement at this pixel. Hence,
the result is an overview of the motion in the image across the two frames.
The major advantage of this technique is that it is relatively unaffected by
illumination changes and provides motion information as well as a foreground
segmentation. This simplifies the temporal object matching step. Though
powerful, this technique requires large amounts of processing power that is
not available in the context of this work, therefore precluding it as a viable

option in the construction of a solution. However, this technique is still
explored in section 3.2.1.
In the course of this subsection, the basic concepts of background differ-
encing have been covered, together with common limitations of these systems
and a critical appraisal of recent algorithms developed to overcome these dif-
ficulties. At the pixel level, several system were explored such as algorithms
based on simple background differencing, wiener prediction filtering, tem-
poral derivative (max. interframe difference), mean and covariance (single
gaussian), and the now popular mixture of gaussians algorithm in its various
‘flavours’. Slow illumination changes can be dealt with using an adaptive
background model, and fast illumination changes by learning a priori mod-
els of the background or by using frame level detection to learn the new
conditions. YUV colour space, depth and gradient information can be used
to differentiate shadows from true foreground objects. It could be argued
that these algorithms may be close to the limit of what can reasonably be
achieved at the pixel level, and that a higher level information is needed to
provide substantial improvements in performance. As has been detailed, this
high level information can be used to provide feedback to lower layers and
help adapt the background model to provide a more accurate representation.
Many of these tiered level algorithms use a variant of the mixture of gaussians
algorithm to provide the pixel level information as it is generally considered
to be one of the most robust algorithms available today. The foreground
information gathered by the object segmentation stage is then passed on to
the object matching stage.

2.4 Object Matching and Tracking
The object segmentation stage, as described in the previous section, is de-
signed to identify which regions of the image correspond to foreground ob-
jects in a given frame of the image. Typically, the output is a binary image
with each pixel labelled as either foreground or background. Specular noise
is removed using techniques such as morphological opening. Regions are
then grouped together and labelled using connected components and those
regions smaller than a pre-determined size are often discarded to remove
patches created by noise and to simplify the object matching stage. Es-
sentially, the object segmentation stage provides a single frame snapshot,
containing labelled foreground regions. The object matching stage uses this
data and combines frames to provide a temporal correspondence across the
frames and therefore a trace of the progress of the object across the scene.
The tasks of the object matching stage are twofold. The first is to detect
when a new foreground object enters the scene and initialise a structure to
track the object, and to also detect when this object leaves the field of view.
The second is to compute the correspondence between the foreground regions
detected in the object segmentation stage and the objects that are currently
being tracked. Ideally, the object tracker should be able to deal with the
inadequacies of the object segmentation stage. In the best situation, fore-
ground regions, henceforth referred to as silhouettes, match one to one with
the objects being tracked. Difficulties can arise when a single object breaks
up into several silhouettes due to poor segmentation. The tracker should
be able to recognise that such is the case and act appropriately. Similarly,
two objects travelling close together may merge to form a single silhouette.
Therefore, the object tracker may need to match any number of objects to any
number of silhouettes including zero (when objects disappear). In addition,

the ideal tracker should be able to deal with occlusions in its various forms.
This might include occlusions between foreground objects, and foreground
objects being occluded by background objects. This demonstrates that the
tracking process can be extremely complex, with these situations potentially
leading to poor tracking, and therefore inaccurate object positions. These
inaccurate measurements can be modelled as a form of noise. To smooth
the extracted features, such as x and y coordinates, most trackers use some
kind of linear filtering usually in the form of a Kalman filter[26][18]. The
Kalman filter is an optimal recursive data processing algorithm. It is opti-
mal in that it uses all available data, regardless of precision, to produce an
estimate of the desired variable such that error is minimised statistically. It
assumes the system and measurement noises are white and gaussian. Data is
presented to the filter together with an estimate of its accuracy, and the level
of contribution the new data makes is directly related to the accuracy of the
measurement. Unlike some filters, the filter operates recursively giving it the
distinct advantage that all previous data need not be stored and reprocessed
at each stage. This makes the computational requirements relatively light.
At each interval, the filter first predicts the current state using previ-
ous data and then corrects this estimate using the new data in a two-step
predictor-corrector cycle. This cycle is illustrated in figure 2.2.
At time t the predictor step projects the variables forwards in time to
predict the state and state error covariances at t+1. The corrector step then
improves this prediction using the measurements made at t+1 to give the
new object state and error covariance. The a priori prediction is corrected
by a weighted difference between the expected and measured values. The
weighting reflects the level of confidence (variance) the system has in the
predicted value and the measured value. If the predicted value is known

Figure 2.2: The recursive two step predict-correct Kalman cycle

with relative certainty and the new measurement is not, the ‘correct’ step
will rely more heavily on the ‘predict’ step than the new measurement. In
the extreme – where no new measurement is available – the system can rely
entirely on the prediction step. This can be used in tracking system where
tracking is lost due to occlusion, but the system wishes to predict where the
object may be so that tracking can resume when the object emerges from
occlusion. This form of Kalman filter is only able to model linear systems.
Though the Extended Kalman filter exists to model non-linear systems, often
only the basic generalised Kalman filter is used to model tracked objects,
since this is generally considered to be sufficient.
In their real-time tracker, Stauffer and Grimson[22] maintain a ‘pool’ of
Kalman filters. Matching between connected components and each Kalman
filter model is achieved probabilistically with each model being matched to
the connected component it is most likely to ‘explain’. Those connected
components that are not sufficiently explained by existing models, and are
present for several frames, will have a new Kalman filter created to explain
them. If a model does not have a match, its fitness level drop by a constant
factor, instead of being removed immediately. In this case, the predictor step
of the Kalman filter dominates and simply projects the state vector forward.

If a match is found within a few frames tracking can resume. Matches that
fall below a given fitness level are removed. Although the system is able
to deal robustly with temporary loss of tracking, it has no way of dealing
with situations such as two foreground objects merging into a silhouette, or
a single object being poorly segmented as several silhouettes. Therefore the
underlying assumption is that a single silhouette will match a single object.
Another weakness is that large number of objects in the scene could also
cause heavy processor loading as the model matching algorithm is prone to
combinatorial explosion.
In an attempt to address these issues, Owens[20] designed a greedy algo-
rithm to match silhouettes to objects, with silhouettes’ splits and merges in
mind. At the core of the algorithm is a cost function. The cost of match-
ing a silhouette to an object depends on how closely related their positions,
bounding box geometry, area, and intensity histograms are. This cost func-
tion is replicated here for development and comparative purposes, and is
described in section 3.3.2. Initially, the objects are matched to silhouettes
by lowest cost. In situations where two objects match a single silhouette, a
macro object (an artificial merged version of the two objects) is created to
be submitted to the cost function to test whether two objects have merged.
If this is the case the silhouette is split into two using a line perpendicular
to that of the linear regression line, yielding satisfactory results with little
computational overhead. Later in the algorithm, unmatched silhouettes are
added to already matched objects if they reduce the cost. Using the match,
the state vectors (area, height, width, histogram) are updated ready for the
next frame. This algorithm is therefore able to deal with poor silhouette seg-
mentation resulting in fragmentation, and two objects merging into a single
silhouette. The partitioning algorithm assumes that there is very little over-

lap between the two objects and is therefore unable to deal with situations
where large amounts of occlusion occur. This design of this system inspired
many of the basic features of the final design of the tracking system described
The system developed by Koller et al.[16] tracks cars on a motorway
scene by fitting closed cubic splines to the extracted silhouettes. The result
is a smooth contour around each car, with 12 control points which are used
as the state vector. A Kalman filter is used to smooth the control points
as they are projected forwards in time. The algorithm assumes an affine
motion model. In the case of heavy occlusion, an object’s apparent centroid is
shifted producing imprecisions. This is solved using an explicit depth ordered
detection step. Vehicles closer to the bottom of the screen are assumed to
be closer to the camera and can therefore be ordered by depth. Occlusions
reasoning is performed using this depth information together with the cubic
spline contours, with apparently good results.
Attempts have also been made to track pedestrians using deformable
models. Baumberg and Hogg[2] fit an adaptive eigenshape model to the
silhouettes. Prior to tracking, a set of training shapes is used to obtain
a model. This model is then used to fit the spline around the silhouette.
Unfortunately this method can only track a single pedestrian and makes
no attempt to deal with occlusion or the splitting of silhouettes. It does,
however, demonstrate that robustly fitting a spline to a pedestrian silhouette
is feasible.
W 4 , developed by Haritaoglu et al.[8], uses a similar concept of shape
correlation in its tracking algorithm. In order to find candidate silhouettes,
the bounding box of an object is projected forward in time. Those silhouettes
which overlap sufficiently with the bounding box are considered potential

matches. The centroid of the object is measured using the median as it is
more robust to the large motion of extremities. The centroid is also projected
forward and its estimate is used to further reduce the number of candidate
silhouettes. After assessing which silhouettes are potential candidates, a
binary edge correlation between the previous and current silhouette is used
to find the exact match. Typically, the correlation is dominated by the head
and torso parts of the body, which change little in appearance from frame to
frame. Silhouette splitting is dealt with by simply tracking the two regions
independently for several frames. Only if they can be tracked reliably will
they then be considered to be distinct objects. Heavy occlusion is not dealt
with explicitly, but objects that merge are simply tracked as a single object.
To determine the correspondence of people tracked before the merge and the
people emerging once they split again, W 4 constructs appearance models
whilst the pedestrians are tracked as single entities. These are then use to
match the objects when they reappear. A ‘cardboard’ model is then used to
find the head, torso, hands and feet of the pedestrian. The bounding box is
divided into search areas for each body part, and theses areas are searched.
For example, hands are found by finding extreme regions connected to the
In this section, several different methods have been examined, each with
their specific strengths and weaknesses. All methods attempt to capture the
invariant essence of each object from frame to frame, whether it be its po-
sition, area, histogram or silhouette shape, in order to match a silhouette
from the previous frame to the current one. Clearly a truly invariant metric
is impossible to capture for a variety of reasons. The most significant is that
most tracking algorithms implicitly assume that the scene is a 2D surface,
with no concept of depth or the fact that objects may rotate about their axis

and change pose. Other difficulties such as camera noise, poor foreground
segmentation and occlusions conspire to make the robust tracking of pedes-
trians and cars extremely challenging. The following section will provide a
detailed description of the approaches taken to overcome these challenges,
followed in chapter 4 by an analysis of the results obtained from a detailed
performance evaluation.

Chapter 3

Tracking Algorithm

This chapter introduces the tracking algorithm, whose aim is to robustly

track objects through a greyscale scene (640x480) in real-time and at a con-
stant frame rate (4 Hz), and also to classify these objects as either cars or
pedestrians. A constant frame rate is required as this system is designed with
a view to its being used in conjunction with novelty detection systems. Fur-
thermore, a constant frame rate might be considered a corollary requirement
if the system is to be considered real-time in the purist sense.
The following section provides a high-level breakdown of the tracking pro-
cess into its three main constituent parts, namely: background maintenance,
object matching (the bulk of the algorithm), and object classification. The
rest of this chapter follows this breakdown structure by providing a section
dedicated to each one of these parts in that order. Chapter 4 provides an
overview of the statistical techniques used to assess the quality of the algo-
rithm, together with a statistic analysis of the performance of the algorithm
in realistic scenarios.

3.1 Overview
The diagram below shows in a simplified graphical form the tracking algo-
rithm pipeline. The boxes in bold show the three stages of the tracking
algorithm, from low-level image segmentation to relatively high-level object
tracking and classification.

Figure 3.1: An overview of the tracking algorithm

A frame is captured at 640x480 resolution in 8-bit greyscale and is sent to

the background differencing module. The job of the background differencing
module is to identify which pixels in the image belong to foreground and
which belong to background, and to group these pixel areas into connected
regions and send the relevant information to the object tracking module. The
following module, the object tracking module, is by far the most complex.
The background differencing module only provides a single frame snapshot
of where potential foreground objects might be. The object tracker’s task
is to match the foreground objects found in this frame with the foreground
objects found in the previous frame recursively, thus enabling the system to
follow the paths of individual objects across the scene over time. Finally, the

object classification module simply uses the basic features of an object (such
as height and area) to determine whether it is a car or a pedestrian. These
results are displayed onscreen in the scene description module. Worthy of
note is the dotted line joining the output of the object classification module
to the input of the background differencing. This represents the high-level
feedback that occurs when a car is detected as being parked, as described
in section 3.2.3. In this situation, the tracker incorporates the car into the
background and the car is no longer tracked. This feature is reminiscent of
the high level feedback proposed by Harville[10].

3.2 Background Differencing

This section describes work carried out in the development of the background
differencing module. The first subsection provides a comparative evaluation
of some of the techniques that were researched and tested during the develop-
ment of the final design. The second subsection describes and discusses the
final design in detail, followed by the final subsection detailing the high-level
feedback to this module.

3.2.1 Research and Testing

As was made clear in section 2.3 there are many techniques available to
provide object segmentation, each with their particular benefits and pitfalls.
Early in the development phase many different algorithms were tested on the
sample sequences of video. The first to be developed was a basic background
differencing algorithm, very similar to that used by Owens[20], using a global
threshold. This initial work serves as a benchmark against which the alterna-
tive algorithms might perform favourably. It has been noted that due to the

poor quality of the video – typical of many analog outdoor CCTV systems
– the foreground objects are sometimes poorly segmented. The system of-
ten suffers from fragmentation, in which a single object becomes fragmented
into several foreground regions or silhouettes. Equally problematic are reflec-
tions in windows and the car bodywork, together with shadows. It is these
shortcomings that led to the search for a more robust alternative.
As will become clear further into this document, this body of work focuses
on developing algorithms by gathering statistics to help guide the research
process and to help substantiate one algorithm’s suitability over another. In
the case of the development of the background differencing algorithm this
was found to be close to impossible. A truly accurate statistical analysis
requires the development of a reference standard in which foreground regions
for each frame have been hand marked pixel by pixel. The output from a
given algorithm could then be compared with the reference standard to yield
quantitative measures for false positives and false negatives. Unfortunately,
for a sequence long enough to be of statistical significance the development
of a reference standard would be too time consuming in the context of this
work. Instead, the author relied on previous comparative work such as that
by Gao[7] and Toyama et al.[24] to help guide the research as well as simple
comparative evaluations of the algorithm onscreen.
Instead of relying purely on traditional background differencing, optical
flow algorithms are also evaluated, using them to produce a map of the im-
age motion within the scene. Whilst optical flow systems vary in technique,
they all try to solve the same essential problem which can be characterised
as follows: given two functions F (x) and G(x) which are the respective pixel
values at position x in the two images – for example a 5x5 window – the aim
is to find the displacement vector h such that some measure of the differ-

ence between F (x + h) and G(x) is minimised. In doing so, the algorithm
is matching two regions that are the same in both images and therefore the
motion between the two images is revealed. The output is usually an array
of the vectors representing the displacement at each pixel. One of the most
commonly used criteria by which two image windows can be compared is
normalised block correlation, which is simply the sum of the squared differ-
ences between pixel values – though other comparisons exist. Techniques for
searching through the values of h also differ. The simplest is to exhaustively
search all values of h to find the minimum difference window. However this
is extremely processor intensive, so heuristic search algorithms such as hill
climbing are often used. However, hill climbing itself is prone to fall into
local minima and to never find the global minimum.
There exists a dizzying array of optical flow techniques, but one of the
most popular optical flow algorithms is the Lucas Kanade algorithm[17]. Its
popularity is due to the fact that to many it strikes a good balance between
speed and accuracy[1]. As with many optical flow algorithms it seeks to
minimise the squared error between blocks of pixels. However, instead of ex-
haustively searching a given neighbourhood to find a good match it employs
a search method based on the image spatial gradient, thereby limiting the
search overheads. With today’s rapidly accelerating hardware, it was consid-
ered as a potential real-time object segmentation method. The advantages
over standard object segmentation technique are substantial. Firstly, optical
flow does not suffer as greatly from illumination changes (compared with
background differencing), since the algorithm operates on a frame by frame
basis rather than maintaining a reference image. That means that only a
large change in lighting over the space of a single frame could disrupt track-
ing. However, since the tracker under development is intended for outdoor

use, this should not be a problem. Secondly, since there is no concept of a
reference image there is no need to initialise the algorithm. Tracking can
begin with foreground objects within the scene. The final and perhaps most
powerful advantage is that it provides not only the detection of movement
but also the vector of that movement – thus potentially greatly simplifying
the object tracking stage. Whilst the advantages may seem overwhelming,
the disadvantages are also imposing. Apart from the obvious computational
costs, optical flow also suffers the aperture problem. Zones of homogenous
colour (or greyscale brightness values) cannot be reliably tracked, since there
is little or no textural information to be matched. The edges of the homoge-
nous zone are tracked reliably whilst the centre appears not to be moving.
To test the optical flow algorithms on real data, a test application was
written with the help of the Intel-founded open source package OpenCV. The
standard block-matching technique and the Lucas Kanade have been eval-
uated for both speed and accuracy. The Lucas Kanade algorithm produces
very disappointing results, even on artificially generated images. The results
may be explained by the fact that the search approximation algorithm used
by Lucas-Kanade relies on the displacement vector being very small. The
larger the displacement, the less accurate the approximation. Reliable track-
ing may require a very high frame rate and/or slow moving objects. Shown
in figure 3.2 are results comparing the Lucas Kanade (LK) with the block
matching method. The first two images are two successive frames, taken
roughly 0.25 seconds apart. The following two are the results from Lucas
Kanade and Block Matching respectively.
Prior to the optical flow calculations, both input images are convolved
with a 3x3 gaussian kernel to smooth out noise. The Lucas Kanade windows
is set to 11x11 – this appears to yield the best results without excessive

Figure 3.2: Example results from a Lucas Kanade Optic Flow Algorithm – 2
successive frames, and the results from Lucas Kanade and Block Matching,

smoothing. The block matching window is 5x5, with a shifting window of

1x1 and a maximum range of 11 pixels.
In this sequence, a pedestrian is walking from the top of the scene to the
bottom so one would expect to see optical flow that is static on most of the
scene with the pedestrian outlined as the area moving ‘south’. Clearly, the
accuracy of block matching is far superior to that of the Lucas Kanade (L-K)
algorithm in this case. Although L-K has fairly accurately picked out areas
of motion, it has failed to accurately identify the vector of that motion. The
block matching algorithm, although the above example is relatively accurate,
was also found to fail in many frames. This was not the greatest problem
however, as this test had to be performed offline for the simple reason that
the maximum frame rate – on a 3.0Ghz machine – was a mere 3Hz. This,
together with the inaccuracies, precluded its being a serious alternative to
standard foreground segmentation.
To address these issues, an iterative pyramidal model of the Lucas Kanade
algorithm can be implemented that is designed to track only a limited set
of points through a coarse-fine refinement algorithm. The term Pyramidal

refers to the ‘pyramid’ of images that is used for the coarse-fine refinement.
At the bottom of the pyramid lies the input image in full resolution. Above
it (conceptually), lies the same image downsampled to half the resolution.
Above it, once again at half the resolution of the layer beneath and so on.
The algorithm begins at the top of the pyramid and computes the optical
flow at the relatively low resolution. A displacement of a single pixel at this
top layer corresponds to a greater motion on the layer beneath it, which in
turns corresponds to an even greater motion to the layer beneath this second
layer and so on. When the algorithm moves down a layer, the motion of
the area above is projected down onto the surface below so it only has to
calculate the residual motion. In the standard Lucas Kanade tracker only
motion over a very small distance can be accurately tracked. Thanks to
the iterative pyramidal interpretation, large motions can now feasibly be
tracked. The number of pyramids depends on requirements and available
processing power, but the figure is typically around three. To reduce the
computational overheads, only a sparse set of points is tracked. This seems
a sensible approach when one considers that points with strong gradients
and high texture are the most easily tracked. If tracking these high-gradient
points is favoured, the results are likely to be far more reliable whilst reducing
processing overheads. Results of this approach are shown in figure 3.3.
These two images are the same two frames as the previous comparison,
in full size. In the first image, good points to track are chosen as those
having a high eigenvalue. The second image shows where these points are
now projected to by the pyramidal LK algorithm. This processes is recursive
frame by frame so that a point can be tracked across the screen. It should
be noted however, that this is not strictly an object segmentation algorithm
due to the fact we have only chosen a sparse set of points. For this algorithm

Figure 3.3: Example results from the iterative pyramidal Lucas Kanade ap-
proach – points chosen in the first image are tracked to their equivalent points
in the second image

to be of any use, foreground regions have to first be identified by a standard

foreground segmentation technique. Its more likely use is as an additional
tool in the objects tracking stage, where it may be useful to be able to match
areas of the scene from one frame to the next. It is presented here because it
is a natural extension to the standard optical flow algorithms and of potential
use as an aid to more conventional tracking.
Showing a two frame snapshot of this algorithm does not convey the true
nature of the algorithm. Though surprisingly fast ( over 4Hz on 640x480
images) and reliable, points do have a tendency to drift and may drift entirely
from the original object. This can be countered somewhat by using the

estimated accuracy measure that each point produces at the end of each
frame. If the accuracy drops below a threshold, it can be safely discarded.
It also has a tendency to follow the edges of shadows, and reflections in
the bodywork of cars and windows. It was considered as an aide to the
object tracking section, but a fairly substantial computational cost (on top
of a standard object segmentation method) together with unreliability when
tracking objects that are further from the camera make it a secondary option.
A natural simplification of the optical flow algorithms is normalised block
correlation, in which blocks from the reference image are compared with the
current image blocks (once normalised for lighting changes). This system has
been tested to see if it could guard against lighting changes and shadows.
In theory, areas in shadows may change in intensity but their texture will
remain mostly unaffected. Whilst the algorithm is effective at blocking out
large areas of shadow, it still picks up the edges of the shadows where light
became dark. It also suffers from people’s clothing having little or no texture
in the centre, just as the background tarmac has no texture, and therefore
people are often very poorly segmented. This technique could work well
against highly textured backgrounds such as a paved walkway, but is clearly
unsuitable in this situation. Efforts to combine the lighting difference and
texture difference also fail to provide significant improvements over lighting
difference alone. Another more predictable downside to this technique is
that the resolution of the image is downgraded, due to its being dealt with
in blocks.
Several other techniques were compared including the use of maximum,
minimum pixel values together with maximum interframe difference, as is
presented in W 4 [8]. Though good results are obtained over short scenes,
no reliable method for updating the pixel’s maximum and minimum, and

maximum interframe difference over long periods was found. The two most
reliable techniques for foreground segmentation on the test data were found to
be the Mixture of Gaussians (MoG) as proposed by Stauffer and Grimson[22]
and the basic background differencing algorithm that is described in the final
design. A brief overview of this algorithm was covered in section 2.3.2 and
further details can be found in the Stauffer and Grimson[22] paper. The
implementation used here differs from the original in several minor ways. The
major difference is that it uses a constant update coefficient, rather than one
based on the likelihood that a pixel was correctly matched to the gaussian.
As well as apparently improving results, this also creates smaller processing
overheads. A minor difference is that a gaussian is considered a match to the
current pixel if it is within 4 standard deviation from the mean, not 2.5 as
suggested by Stauffer et al. The threshold, t, which set the minimum portion
the gaussians have to ‘explain’ to be considered background gaussians is set
relatively high at 0.8. The rate of update, α, is set to 0.005. The number
of gaussians in the mixture is fixed at 3. Having tested various settings and
constants, these seem to yield the best results. The background differencing
algorithm it was compared with uses a simple globally thresholded pixel
by pixel difference between the current image and a reference image. The
reference image is then updated frame by frame using a mask produced by
the difference image. Full details of this algorithm are provided later in
section 3.2.2 as it was eventually chosen in the final design. The threshold for
the background differencing is set at 12 (on a scale of 256 possible greyscale
values). The output of both algorithms is subjected to morphological opening
to reduce specular noise.
To ensure a fair comparison, the various parameters used in each algo-
rithm have been tweaked until each produced the best results. The pa-

rameters are balanced so that they produced roughly the same amount of
foreground. Segmentation results from the two algorithms are surprisingly
similar, to the point of being nearly indistinguishable. Pictorial comparisons
are provided in figure 3.4.

Figure 3.4: 2 Comparative results of pedestrian segmentation. On the left is

the original image, the middle is background differencing and the right image
is the MoG result

The upper segmentation comparison is typical of the vast majority of

segmentation attempts. The results from both algorithms are nearly indis-
tinguishable, and one algorithm cannot clearly be declared to be superior to
the other in terms of the quality of segmentation of moving objects. How-
ever, there are occasional differences in behaviour as highlighted in the second
(lower) example, where the MoG segmentation is clearly inferior. Prior this
frame, a car has slowly entered a parking spot to the right of this image,
crossing the position where the pedestrian is currently located. As it did so
the MoG algorithm created a new gaussian mixture for all the foreground pix-
els. These slowly gained in weight over the frames that the car was present at

that position, as per the algorithm’s normal adaptive behavior. When the car
moved out of the way, the pixels briefly regained their normal background
gaussian. The pedestrian here happens to be of a very similar brightness
to the car that was previously in the foreground. Therefore, the algorithm
switched back to the gaussians created to model the car’s brightness, already
higher in weight, and once again began increasing their weights. Due to this,
the gaussians quickly became the preferred model of the background and
parts of the pedestrian became subsumed into the background. This situa-
tion is relatively unusual, but becomes considerably more noticeable as traffic
increases. One potential problem, that of relatively stationary foreground ob-
jects disappearing into the background, proved not to be a problem on the
four hour test sequence. As cars park, they disappear into the background
relatively quickly, but as pedestrians are never stationary for the same length
of time they are never incorporated into the background. If this ever became
a problem high-level feedback from the object tracking module could be used
to prevent this from happening. Generally, both algorithms perform equally
well at normal segmentation.
The most notable differences in performance appear only on some of the
more taxing sequences, where frequent light changes, camera judder and wav-
ing trees cause false silhouettes to appear. As expected, the MoG algorithm
is generally far better suited to dealing with these situations. The most dra-
matic differences are apparent during camera judder. In figure 3.5, heavy
winds have caused the camera to shake temporarily. Ideally, only the car
should be segmented as foreground.
The MoG segmentation is on the right side and the background differenc-
ing on the left. On the difference image the judder has caused high gradient
areas to be incorrectly segmented as foreground. This is simply because it

Figure 3.5: Comparison of performance of Background Differencing (left) vs
MoG (right) with camera judder

is at these point where a shift in the position of pixels will cause the highest
contrast with the reference image. This phenomenon regularly occurs, lasting
only 1 or 2 frames, at times of very high wind. At these points of high gra-
dient, the MoG algorithm has learned that the pixels have a higher standard
deviation and is therefore less likely to segment them as foreground. Simi-
larly, waving trees in the background are generally ignored. Though at first
this may seem to be enormously beneficial, these problems occur relatively
rarely. When they do, the object tracker may instantiate new objects for a
single frame but tracking will almost immediately be lost leading to few or no
ill effects. Also, silhouettes created in these situations tend to be too far from
the object being tracked to cause any silhouette misallocations in the object
tracker. Far more common is confusion caused by reflections as a pedestrians
moves close to a car. This type of poor segmentation, unfortunately, cannot

be addressed by any of the low-level segmentation algorithms covered here.
A similar problem exists for shadows, which are indistinguishable from true
foreground objects in both algorithms (since we are only utilising greyscale).
In summary, though several techniques for object segmentation have been
examined, none of them offers any substantial improvements over background
differencing. Though MoG does provide some improvements, particularly in
the reduction of waving trees and camera judder, it does not improve the
segmentation of foreground objects with regard to the most serious problems
such as fragmentation. Given this, the large computational cost incurred by
the MoG algorithm cannot be justified. The test algorithm – unoptimised
– runs at around 3 Hertz on a 3 Ghz machine. Though it may be possible
to optimise it to run at the minimum 4Hz, this would leave precious little
computational time for the object tracking section. For this reason, the final
design is based on basic background differencing as described in the next
subsection. In the presence of specialised – or significantly more powerful –
hardware, the MoG algorithm might be preferred.

3.2.2 Final Design

An overview of the final background differencing algorithm is shown in figure

Boxes in bold represent those processes which are part of the background
differencing stage. Dotted lines represent feedback. A central concept of this
system is that of the reference image. The reference image, R, represents the
background empty of foreground objects. The current image I is compared
with the reference image R pixel by pixel to produce a difference image
D. Only if the two pixels under comparison differ by more than the global
threshold T , is the pixel then classified as foreground. At time t, presented

Figure 3.6: An overview of the background differencing algorithm


 1 |I (t) − R (t)| > T
i i
Di (t) = (3.1)
 0 otherwise

where i serves as an index to all the pixels in the image.

Optimal performance is achieved when T is set to a value of 12 (of a
possible 256 greyscale values). The result is a binary image, representing
a pixel by pixel dichotomy between foreground and background. Noise is
removed using a process of morphological opening, which involves an erosion
followed by a dilation. Both erosion and dilation are performed using a 3x3
structuring element. Erosion removes only small groups of pixels, whilst
leaving the larger ‘blobs’ mostly untouched. This removes specular noise.
The disadvantage is that areas that are connected only by narrow isthmuses
can become disconnected. To mitigate this effect, dilation is then applied
to reconnect these regions. In this implementation, the dilation is applied

twice to amplify this effect as a large number of silhouettes can cause a large
increase in the computational task performed by the object tracking stage.
A stage by stage pictorial example is shown in figure 3.7.

Figure 3.7: Background differencing followed by morphological noise reduc-


The reference image is updated every frame to enable it to adapt to the

changing lighting conditions. The difference image D determines how the
reference image is updated. Areas determined to be foreground by equation
3.1 are updated more slowly than background areas. At time t, the reference
image is updated as follows:

Ri = αi Ii (t) + (1 − αi )Ri (t − 1) (1 ≤ i ≤ n)

where i again indexes the n image pixels. The update value αi is chosen
per pixel as follows:


 0.1 if Di (t) = 0
αi =
 0.001 if D (t) = 1

Thus, foreground regions are updated at a much reduced rate to prevent

them from unduly influencing the background. The difference image is ef-
fectively used as a mask. Foreground areas are still allowed to exert a small
influence to allow misclassified pixels to slowly return to their background
As has been discussed previously, moving background objects such as
waving trees can cause segmentation errors. In these situations a mask could
be placed over the offending area to prevent its misclassification. Areas with
trees, flags and so on tend to be in areas where little or no foreground traffic
ever appears. Throughout testing, however, this was found to be unnecessary.
Occasionally, large changes in brightness can cause large sections of the
background to be misclassified as foreground. Foreground, almost by defi-
nition, tends to account for less than 50% of the image. This assumption
can be exploited to detect these differencing failures, such that if more than
50% of the image is segmented as foreground the tracker is reinitialised by
replacing the reference image with the current image. If foreground objects
are present this will cause the object segmentation to be inaccurate until the
system has time to adapt. Thankfully, this event is relatively rare.
The foreground pixels of D are then processed using a fast connected
components algorithm to find groups of fully connected regions in the image,
or silhouettes. Connected components algorithms differ in their definition of
connected. 4-connected (each pixel has 4 potential connected neighbours)
algorithms consider regions connected only by diagonally adjacent pixels not
to be connected, whilst 8-connected do. This implementation uses an 8-
connected definition. At this stage, the system has a group of silhouettes

each representing a foreground region.
Prior to the object tracking stage, a number of features are extracted from
each silhouette. The features are: the area a, height h, width w, centroid
(xc , yc ), and a 16-bin histogram, g, of the brightnesses of the pixels. Area
is simply the number of pixels in the silhouettes, and height and width are
those of the Minimum Bounding Rectangle (MBR), the smallest rectangle
that fully encloses the silhouette. The centroid is calculated as the mean
pixel ‘x’ value and the mean ‘y’ value as follows (for x):

1 X
x= x (3.2)

where a is area of the silhouette and S is the silhouette under considera-

tion. The features of a silhouette are collectively known as the feature vector
and is defined as f = [xc , yc , a, h, w, g]. This collection of feature vectors is
then passed on to the object tracking stage. When an object first appears
on the scene, it is initialised with the feature vector of the silhouette which
prompted its creation. The term object refers to the physical object that is
being tracked from frame to frame (and the internal computational structure
that reflects it), whilst silhouette is merely the foreground region which has
been detected at the current frame. Subsequent tracking is based on the
concept that a particular object’s features will remain relatively similar be-
tween two frames. When a new frame is processed and a set of silhouettes is
produced, it is not known which silhouettes match to the objects currently
being tracked. The object matching stage exploits the concept of feature
similarity to help achieve this match, and is described in section 3.3.

3.2.3 High-level feedback

The object tracker is designed with outdoor scenes in mind, particularly

streets and car parks and is designed to track both pedestrians and cars. The
tracking of pedestrians alighting from a vehicle must start as soon as they exit
the vehicle. If the vehicle is still part of the foreground, pedestrians emerging
from the car will be segmented together with the car as one large silhouette,
and will therefore be indistinguishable from the car. Also, once the car is
parked there is no longer any reason to track it. For these reasons, feedback
from the object tracker is used to insert into the background cars which are
deemed to have parked. If an object over the size Sstat has been stationary for
longer than time Tstat , then the area delimited by the object’s MBR from the
current frame is printed onto the reference image. In the final version, Sstat is
set to 2500 pixels and Tstat to 2.6 seconds. The parked car is now present on
the reference image and will no longer be segmented as foreground. Once this
is done, the internal object representing the car being tracked is removed.
A stationary object is defined as one which has not moved more than 1.5
pixels over the space of a frame (to allow diagonal movement of (1,1) ). If
this condition does not hold for the whole Tstat seconds, it is not inserted
into the background. The stationary distance is not set to zero due to the
occasional poor segmentations and camera judder which causes the object
to move almost imperceptibly. The limit on the area of inserted objects is
to prevent stationary pedestrians, though rarely stationary for long enough,
from being inserted into the background. The Owens tracker relies on a
very similar method with one crucial difference. Objects subsumed into the
background are placed into a ‘recently inserted objects’ list. In the event of
a drive-off, where the vehicle leaves its parking spot soon after insertion, the
‘hole’ instantiates a new object. The centroid of this new object is compared

with the recently inserted object and if they match the hole is patched with
a recent record of the reference image taken when the object was originally
inserted. This allows for slightly faster patching of ‘holes’. However, the
‘drive off’ issue was never found to be a serious issue in this implementation,
and the current patching system was found to be more than adequate.
It was noted early during the testing phase, when the area limit was not
present, that this high-level feedback is also very adept at removing patches of
noise created by lighting changes. The noise patches cause the object tracker
to instantiate an object which is tracked frame by frame. If the lighting
change persists, the object tracker is able to track this object from frame to
frame and it appears to be stationary. Once the area limit Sstat is introduced,
this beneficial effect is eliminated. To remedy this, a second and higher time
limit, Tstat2 ( 6 seconds) has been introduced for those objects which are
smaller than Sstat . The higher time limit prevents pedestrians from being
incorporated into the reference image due to their occasional movement, but
allows persistent noise to be eliminated.
This simple technique is extremely effective. The only noticeable prob-
lems occur when drivers leave their vehicles in a hurry, causing the pedestri-
ans to also be included in the background. Another minor problem occurs
when the car is included into the reference image when the passengers are
clearly visible through the car windows. In both these situations the pedes-
trian will move away from the vehicle and leave a ‘hole’ where the reference
image is now inaccurate. This is often corrected relatively quickly as the
system will treat this ‘hole’ as a stationary object and in turn incorporate
it into the background after Tstat2 seconds. A similar situation also occurs
when a parked car pulls away from its parked position and is corrected after
Tstat seconds.

3.3 Object Matching
This section describes the object matching part of the object tracking system.
For clarity, the overview provides a brief example of the problem the system
is designed to solve followed by a brief summary of the architecture of the
modular object matching pipeline. Subsequent sections examine each module
of the pipeline in further detail.

3.3.1 Overview

The result of the background differencing module is a group of silhouettes,

each with its associated feature vector. Based solely on this input informa-
tion, the object tracker’s task is twofold:

• Object creation and destruction: Silhouettes larger than a given

size that cannot be matched to existing objects instantiate new objects.
Objects deemed to have no feasible silhouette match are discarded.

• Object to silhouette matching: Objects from the previous frame

need to be matched to the silhouettes in the current frame, and updated
to reflect their new positions and features.

The first of the two tasks is relatively straightforward, assuming that

the second task is performed correctly. However, poor object to silhouette
matching can cause leftover silhouettes which instantiate as ‘extra’ objects,
and can also cause object tracking to be lost. Therefore the focus of attention
is on the second task.
Objects, just as with silhouettes, are associated with a feature vector
comprised of centroid, area in pixels, height, width and a 16-bin histogram.

Height and width are that of the Minimum Bounding rectangle which en-
closes it. When an object is first instantiated it inherits the feature vector
of the silhouette which caused its instantiation. In subsequent frames, the
tracker attempts to match the object to the silhouette(s) whose feature vec-
tors are most ‘similar’ to that of the object. In an ideal situation, a single
silhouette matches to a single object. In this case, a reasonable solution might
be to simply pick the silhouette which is closest to each object. However,
poor segmentation often causes this simplistic approach to fail. For exam-
ple, a silhouette of a pedestrian may become fragmented into two parts, in
which case the tracker needs to recognise that both silhouettes are part of
the same object. Similarly, two proximate objects may be segmented as a
single silhouette. This situation also needs to be recognised and the original
silhouette partitioned into two silhouettes. Both fragmentation and merging
can occur in a single frame, complicating the issue.
To illustrate the problems the tracker must solve, an instance of the prob-
lem is shown in figure 3.8.

Figure 3.8: An example of a challenging object-silhouette matching with

merging and fragmentation

In this example two pedestrians are being tracked – labelled A and B –

walking closely down the scene up to time t. Each object is enclosed by a
MBR highlighting where the object tracker believes the objects to be. The

white lines on the leftmost image highlight the path taken by the pedestri-
ans. At time t + 1, background differencing produces three silhouettes. The
tracker must correctly identify which silhouettes belong to which objects. In
this case, silhouette 1 belongs both to object A and B, silhouette 2 belongs
only to object A, and silhouette 3 belongs to object B. Once the correct
object to silhouette match has been ascertained, the tracking algorithm may
have to merge and partition silhouettes – a result of which can be seen in
the rightmost image of the above figure. The darker silhouette is allocated
to object A, and the lighter to object B. What follows in the rest of this sub-
section is an overview of how the tracker achieves this, followed by a detailed
look at each component of the tracking algorithm.
The first step of the object tracking algorithm is to measure the distance
between the expected centroid of every object, Qj , and the centroid of every
silhouette, Si . This used to produce a valid match matrix V . A silhouette to
object match is only valid if the distance between them is below a search ra-
dius r. In the final design r = 150, where distance is measured in pixels. This
valid match matrix defines which silhouette-object matches are permitted.
This helps to safely reduce the number of potential matches and therefore
eases the processing requirements in later stages of the algorithm.
V is a n×m matrix where n is the number of objects and m is the number
of silhouettes.
S0 S1 ... Sm
Q0 0 1 ... 0
V = Q1 1 1 ... 1
... ... ... ... ...
Qn 0 0 ... 1

where 
 1 if ||q , s || ≤ r
j i
Vji =
 0 otherwise

and where qj and si are the object and silhouette positions on the image
plane, respectively.
The valid match matrix merely indicates potential object-silhouette matches.
To represent a concrete matching between object and silhouettes, the n × m
match matrix, M , is introduced, for which an entry of ‘1’ in the table repre-
sents a match between an object and silhouette. An object, Qj , may only be
matched to a silhouette, Si , if Vji = 1. The object-silhouette matching can
now be visualised as a finite search space. Each potential object-silhouette
match in the valid matrix V can be set to either ‘1’ or ‘0’ in the match
matrix. The search algorithm, described in section 3.3.5, searches through
this search space of all possible silhouette-object matchings to find the lowest
cost match matrix allocation.
Before the cost of a given match matrix object-silhouette matching can be
calculated, some silhouettes may need partitioning and/or merging. This is
performed by the conflict resolution module. This module resolves conflicts
where several objects are matched to a single silhouette or vice versa, by
merging and partitioning silhouettes. The result of this operation is a one to
one correspondence between silhouettes and objects. It is worth noting that
objects may be matched to no silhouette, and silhouettes may be similarly
The cost of a silhouette to object match is calculated by a collection of Self
Organising Maps (SOM) neural network which have previously been trained
on hand-marked (’reference standard’) sequences of video. The SOMs have
learned to quantify the novelty of a silhouette to object match, such that an
unusual matching (such as a very large difference in area or width), produces

a high cost compared to relatively normal ones. The costs of all the object
to silhouette matches are added together to produce a global cost – which
defines the global cost of a given match matrix M . Added to the global cost
is a cost for each unmatched object and a small cost for each new object
created. Further details of the cost function are provided in section 3.3.2.
With the lowest cost match matrix found, each object’s feature vector is
replaced with the features of the silhouette with which was associated by the
conflict resolution module. In addition to the feature vector, each object also
maintains a history of the path it has taken through the scene. Obviously,
this is updated with the latest centroid.
Each object is then classified as either a car or pedestrian using a multi-
layer perceptron neural network – trained using examples of pedestrian and
car features obtained from the reference standards. This classification task
is covered in more detail in section 3.4.
The following diagram is an overview of the system architecture:

Figure 3.9: A breakdown of the object tracker algorithm

This diagram is presented here in a slightly simplified format for clarity.
For example, the partitioning module makes heavy use of the costing func-
tion. The search algorithm iterates over a subset of the search space of all
the different allocations of M , using the cost function to guide the search for
the best allocation.
This modular approach to system design enabled the independent de-
velopment and statistical analysis of the performance of each module. An
analysis of the performance of individual modules and of the system as a
whole is available in chapter 4. The search algorithm and cost function mod-
ules play the pivotal role in the performance of the system.

3.3.2 Cost Function

The role of the cost function is to provide a quantitative measure of the

likelihood of a silhouette and an object being a correct match. This section
describes two cost functions. The first, referred to in this document as the
Owens cost function due to its use in the Owens tracking algorithm[20], relies
on a measurement of difference between two feature vectors. The second
function uses neural methods to learn the typical values of good matches,
and seeks to identify how good a match is on the basis of its accumulated
‘experience’. The first cost function is introduced as a natural precursor to
the second, and is later used in this thesis for comparative purposes in the
assessment of the quality of the second. Finally, the concept of a ‘global
cost function’ is introduced, in which the cost of an entire scene matching is
Associated with each object and silhouette is a feature vector f = [xc , yc , a, h, w, g],
representing centroid position, area, height, width and a 16-bin intensity his-
togram. In addition, each object also has a history of the path it has taken

over its lifetime (centroids) and a speed. The speed for each object is simply
the distance travelled between the last two frames, and is expressed as a
(dX,dY) vector. The overall assumption of the Owens cost function is that
the features of an object will change little over the space of a single frame,
therefore silhouettes whose features are a close match should have a lower
cost. The concept of ‘difference’ is central to the cost function. The ‘differ-
ence’ of the four last features – area, height, width and histogram – of the
feature vector are calculated prior to the final cost as shown in the equations
below. The positions (centroids) of the silhouette and object play no role in
this cost function. For a given object Q and silhouette S four variables are

a Q − aS
pArea = (3.3)
hQ − hS
pHeight = (3.4)
w Q − wS
pW idth = (3.5)
The histogram difference is calculated as the sum of the squared differ-
ences between each bin, and the result of this sum is square rooted:
u 16
dHist = t (uk − vk )2 (3.6)

where uk and vk are the object and silhouette histograms’ k th bins, respec-
Each feature vector difference, excepting the histogram difference, is the
difference scaled by the object’s feature value1 . The Owens cost function is
These calculations are adjusted slightly in the case of macro-object comparison as
described in section 3.3.5

simply the sum of these vector differences, as shown in equation 3.7.

co (Q, S) = pArea + pHeight + pW idth + dHist (3.7)

This gives a basic measure of the similarity of the features of a silhouette

and an object. The weakness of this method is that it naively assumes that
all features should play an equal role in the overall cost. This could be
false, though the likelihood is that the variances of at least the first three are
likely to be similar given that they are scaled by their parent object’s value.
Informal testing using a system of weighted inputs appears to have little or
no impact on performance, and the multitude of possible combinations make
thorough testing a near impossibility. Though assuming the invariance of the
object features from frame to frame seems a sensible option, this assumption
does break down in several situations. For example objects may enter an area
of shadow, radically changing their histogram values. An object travelling
away from the camera at high speed will tend to shrink in area. Objects
entering the scene will initially have a small area and tend to grow very
rapidly as more of their body enters the visible area.
In an attempt to improve the performance of the cost function, a radically
different approach was considered. Returning to the original statement of
the role of the cost function, it seeks to quantitatively measure the likelihood
of a match being the correct one. The following technique borrows from
techniques used in the detection of suspicious pedestrian behaviour. Many
of these systems track the centroids of pedestrians across the scene. A large
number of examples of ordinary (non-suspicious) paths taken by pedestrians
across the scene are submitted to the system, and the system learns some
of the essential characteristics of ordinary paths through the scene. Such
characteristics may be a particular speed, or that a pedestrian never crosses

a particular location (such as an area covered with thick bushes), or that
travelling in a certain direction is extremely rare (such as entering through
an exit). A new case submitted to the algorithm is classified as unusual if it
is not sufficiently similar to the training data. In other words, the algorithm
detects as suspicious any activity which it considers to be novel – this is
often considered to be close enough to the definition of ‘suspicious’ to qualify
as a measure. These algorithms are able to pick up on surprisingly subtle
changes in direction and activity such as a pedestrian dropping his/her keys
and pausing to pick them up. A similar technique was employed to find the
cost of matching silhouettes to objects. By submitting a large number of
hand-matched training cases to the algorithm, the system could learn what
‘normal’ matching look like. The cost is then some measure of the similarity
between the training cases and the new case.
In order to achieve this ‘novelty detection’ the algorithm makes use of
Self-Organising Maps (SOM) neural network, originally proposed by Teuvo
Kohonen. This is a form of unsupervised learning – as opposed to super-
vised. In supervised learning, a neural network is trained using a number
of examples consisting of inputs and their associated outputs. The network
attempts to learn the correlation between inputs and output in order to be
able to predict the correct output when a new case is submitted to it. Un-
supervised learning on the other hand, is trained using only the inputs.
The system learns the underlying structure of the data, rather than the re-
lationship between inputs and outputs. Once the system has learned the
underlying structure, when new data is submitted to the SOM a quantita-
tive measure of how closely the data fits the structure of the training data is
A self-organised map is a n dimensional map of neurons, in this case 2-

D sheet of 40 × 40 prototype neurons. The map is topologically ordered –
neurons which are close together tend to map similar features. The input
nodes can be visualised as lying beneath the plane, and are fully connected to
the SOM prototype neurons. The SOM is trained by changing the weights
of the connections between the inputs and the prototype neurons. This
changing of the weights can also be visualised as the distortion of the 2-D
sheet into d dimensional feature space, in order to reduce the dimensionality
from d dimensions to only 2, and map the structure of the data. Viewed in
this way, the activity level of a prototype neuron is effectively the distance
between the input feature vector and prototype’s position in d dimensional
feature space.
Given an input vector X, the activity level of a prototype neuron i with
weight vector ωi is defined as:

ai = |ωi − X| (3.8)

In order to quantify the novelty of a new input case, the activity levels
of each prototype neuron are calculated. The winning neuron is defined as
that which has the lowest activation level. The measure of novelty is simply
the level of activation of the winning neuron. Therefore input vectors that
are far (in feature space) from any of the trained prototype neurons yield a
high value and are considered to be ‘unusual’.
All three SOMs used in the cost function were trained in an identical way.
The prototype weights are initialised to gaussian random values. Training
is divided into two phases: the first to develop the coarse structure, and the
second for fine-tuning of individual neurons. The values used are in table
3.1. The learning rate is denoted by α, and the size of the neighbourhood
function window is denoted by w. The values for α and w, shown in table

Phase 1 Phase 2
Epochs 100 1000
α 0.1-0.02 0.1-0.01
w 3-1 0-0

Table 3.1: The 2-phase training of the SOMs. The first phase learns the
coarse structure, whilst the second fine-tunes the individual neurons

3.1, are start to end values for that phase.

In order to train the SOMs, large amounts of data have to be collected
from the reference standard. The reference standard, described in further de-
tail in Chapter 4, essentially consists of hand-picked match matrices yielding
the best possible matching between objects and silhouettes. Three sequences
of roughly 90 minutes each were hand-marked to create it, resulting in 75200
frames of data. The first sequence is only used to train the SOMs so that
testing could be carried on the other two sequences. Before this can begin,
the data for each object to silhouette match in the sequence has to be ex-
tracted into a usable format (as a comma separated file) from the reference
standard. In total, 16 variables are extracted. These serve to train 3 indepen-
dent SOMs: the Motion SOM, the Comparative SOM, and the Appearance
SOM. The inputs of each of the SOMs will now be described, together with
the overall function of each SOM. The final object to silhouette cost is simply
the sum of all 3 SOMs’ measures of novelty. A more detailed analysis of the
performance of each SOM, and an overall analysis of collective performance,
is presented in Chapter 4.
The role of the Motion SOM is to quantify the novelty of an object’s
motion, assuming it is about to be matched to a given silhouette. This SOM
is very similar to the SOM used by Hunter et al.[11] to perform local motion

analysis, in order to classify the local motion as either usual or unusual
– and therefore identify suspicious activity. Here, its quantification of the
normality of the trajectory is used to contribute a measure of likelihood of
an object-silhouette matching being correct. In total, there are 8 inputs to
the motion SOM. The first four [x, y, dx, dy] represent what the object’s
position and speed will be, assuming that the object-silhouette matching is
correct. These values are taken from the silhouette, since the object’s latest
position will be updated using the new silhouette’s data. To clarify the
mathematical notation and prevent confusion, it is easier to assume that the
object path has already been updated by the silhouette position. Therefore,
the latest position at time t of the object, denoted (xt , yt ), is equivalent to
the silhouette centroid. Value of dxt and dyt are simply calculated as the
difference in position from time t − 1 to time t:

dxt = xt − xt−1 (3.9)

dyt = yt − yt−1 (3.10)

Drawing on inspiration from the functioning of the brain, many pattern

matching algorithms try to provide some form of context within which the
novelty of the match can be assessed. Using only the first four inputs, it
would be impossible to account for sudden changes in the speed vector of
the object. Short-term memory can be used to provide a form of context, in
the shape of an averaged window of recent values. For these reasons, four
additional inputs are included [w(x), w(y), w(dx), w(dy)]. The function w(x)
is a moving average, with a window size n, and is defined as follows:

1 n−1
wt (x) = xt + xt−1 (3.11)
n n

The first four inputs are subjected to this averaging window to provide a
further four inputs representing the short term memory. Therefore, there are
only 4 ‘original’ inputs into the system, with a further 4 inputs to provide
context. The network is trained on reference standard hand-marked paths
of objects through the scene, and therefore learns to model the normal posi-
tions and motion of objects throughout a given scene. When a new case is
presented to the SOM, it produces a measure of how closely the new case can
be matched to previously seen (training) data. Unusual object position and
speeds, such as crossing the car park at an unusual angle or performing an
instantaneous 180 degree change in direction of motion, should incur a high
cost. The implicit assumption of this technique is that the likelihood of an
object to silhouette match being correct is directly related to how unusual
it is. At first glance this could present a problem. This technique borrows
from the field of the detection of suspicious activity, often measured as how
‘unusual’ the motion of an object is using similar techniques to those used
in this SOM. Therefore, if a pedestrian were to act ‘suspiciously’, the correct
match could lead to a high cost as the system believes this behaviour to be
unlikely. However, the extra costs incurred from suspicious behaviour are
relatively small (the thresholds of detectors of suspicious activity are rela-
tively low), and for incorrect tracking to occur a ‘more likely’ scenario would
have to be available. This ‘more likely’ scenario would have to be in the
form of a better matching silhouette in the close vicinity. In such a case,
the offending silhouette would have to have similar area, height, width and
histogram to match the requirements of the two other SOMs which will be
presented later. These factors conspire to make such an incorrect match
due to suspicious behaviour very unlikely. The network was trained on data
containing only ‘normal’ behaviour, therefore making suspicious behaviour

stand out by definition. If the tracking of suspicious pedestrians was truly
believed to have been compromised it could be trained on data which in-
cluded suspicious activity, thereby removing the element of it being unusual
and removing any possible negative effects. Given that this is an issue which
could have negatively affected performance, it is dealt with more closely in
Chapter 4.
The second SOM, known as the Comparative SOM, is designed essentially
to replicate the role of the Owens cost function. It attempts to quantify the
‘difference’ in features of the object and silhouette. The first four inputs are
identical to the Motion SOM: [x, y, dx, dy]. The last four are identical to the
four elements of the Owens cost function: [pArea, pHeight, pWidth, dHist].
In this SOM, the first four inputs can be considered to be the ‘context’. They
do not represent short-term memory, but instead allow the system to react
differently in different areas of the scene and under different conditions of
motion. It is this ‘context’ which provides an advantage over a more basic
approach such as the Owens cost function. For example, as an object at
the edge of screen enters the scene it will tend to grow quickly in size as
it becomes increasingly visible. The SOM models this effect and will tend
to favour matchings growing in area, where the object is at the edge of the
screen heading in to the scene. The same is also true of pedestrians who,
when they exit their car, are often obscured by their own vehicle or another
parked car. At specific areas of the scene the pedestrians tend to ‘grow in
size’ as they emerge from behind the cars. Similar effects are true with the
other three features: pHeight, pWidth, dHist. Again, these are examined
more closely in the statistical analysis in chapter 4.
The third SOM is the Appearance SOM. As its name suggests, its role is to
assess whether the basic appearance of an object is close to the norm seen in

the training cases. The four features used are: area, aspect ratio, height and
width. As before, the system assumes the current object silhouette match
is correct and the features would be inherited from the silhouette in such
a case. Therefore, the features are lifted from the silhouette. Aspect ratio,
AR, is calculated as follows:

AR = (3.12)
W idth
In addition to these four inputs, the centroid (x, y) is also included in
the inputs to again enable the SOM to learn the ‘normal’ object appearance
within the context of its position in the scene. Initially, this may seem
unnecessary as the appearance of an object should vary little over a scene.
However, the average area of an object varies considerably depending on its
proximity to the camera (its ‘y’ position). Similarly to earlier examples, an
object’s appearance is heavily influenced by occlusion – either due to being
partially outside the viewing frustrum (off the edge of the screen), or occluded
by scene objects such as parked cars. When an object enters the scene from
the center of the bottom of the screen, its area, aspect ratio and height may
all be affected. Its width, however, will remain unaffected. The SOM learns
these patterns of appearance that are highly dependent on the position of
objects, helping it to produce a more accurate estimate of the ‘novelty’ of an
object’s appearance.
For clarity, the inputs of all three SOMs are summarised in table 3.2.
The outputs from all three of these SOMs are added together to produce
the final object to silhouette match cost.
There are two types of object this tracker is designed to track – vehicles
and pedestrians – which differ greatly in features and motion. With a single
SOM set, these two types of object will generate different cluster centres –

SOM Inputs
Motion X,Y,dX,dY, w(X),w(Y),w(dX),w(dY)
Comparative X,Y,dX,dY, pArea, pHeight, pWidth, dHist
Appearance X,Y, area, AR, height, width

Table 3.2: A tabular summary of the 3 SOMs and their inputs

with perhaps a slight overlap where similarities lie. In this way, the tracker is
able to track both types of objects fairly robustly with the same set of SOMs.
However, by creating two separate sets of SOMs – one for each type of object –
each set is able to specialise in tracking that form of object without having to
also generalise for the other object type. It could be argued that it increases
the modelling power of the cost function, whilst introducing no additional
computational burden to the system (only one of the two SOMs is run for
each cost assessment). It also eliminates any chances of overlapping features
from frame to frame. In the single SOM set scenario, if in the current frame
a pedestrian has been matched to a silhouette which would imply movement
that is similar to that of a car, the motion could be considered normal as
it closely matches a cluster centre for vehicle motion. This situation can
be avoided in the two SOM set design. In order to test these underlying
hypotheses, both designs were tested using the methods described in Chapter
4. In the single SOM case, 98.49% of all object matches were within 5 pixels
with an average match distance of 0.35 pixels. This compares with 99.27% of
matches within 5 pixels, and an average match distance of 0.22 pixels for the
dual SOM set design. Therefore, the final design is based on having separate
SOM sets for vehicles and pedestrians. Each set is able to specialise to
represent the motion, differences and appearance of its target object type by
being trained on examples of the target type only. In the reference standard,

the type of object being tracked is recorded alongside the other main features
to make this discrimination possible. For this dual SOM set technique to
be effective during live tracking, the type of object being tracked must be
accurately determined to ensure that the correct set of SOMs is run. This
dependence on the foreknowledge of object type greatly increases the need
for high accuracy in the object classification module described in section 3.4.
So far, this chapter has focused on finding the cost of a single object
to silhouette match. An important concept in the design on this tracking
system is that of a global cost. That is, the cost of a given match matrix
and the resulting object silhouette matches. To find the best match matrix,
the system generally searches for the lowest global cost – though the search
function does make use of the single cost function to search heuristically.
Using this global cost function forces the objects to ‘compete’ against one
another for their best matching silhouettes. An improvement in cost for one
match is not permitted if it causes a larger cost increase for another match.
Given only a cursory glance, constructing a global cost function appears
to be trivial: the global cost could simply be the sum of all the object to
silhouette matching costs. However, this approach ignores the possibility of
new objects appearing and current objects remaining unmatched and dis-
appearing. Using this simple global cost model, all objects would remain
unmatched since the cost of a match is always larger than zero – the cost
of having no match would implicitly be zero. To address this issue, a cost,
γ, is added for each object which remains unmatched in a given match ma-
trix. The value of γ needs to be high enough to prevent a correctly matched
object from becoming unmatched, yet low enough to prevent objects which
have left the scene from being matched to an incorrect silhouette. One might
assume that objects only leave the scene at the edges of the screen, and that

this simple fact could be used to prevent object in the middle of the scene
being incorrectly unmatched and lost. However, pedestrians are sometimes
occluded by cars, buildings and other static objects. As well as disappear-
ing behind objects, heavy occlusions, though rare, can cause extremely poor
results when objects are partitioned. In these situations, it is often best to
simply lose tracking until the object re-emerges and can be reacquired by
the tracker, rather than allow the tracker to continue with extremely poor
tracking. For these reasons, the tracker does not make use of the proximity
of objects to the edge of the screen when deciding if objects should be un-
matched, and instead simply relies on the cost of the object match. In the
majority of cases when an object reaches the edge of the screen and passes
out of sight, the lack of a silhouette within a valid radius forces it to remain
unmatched and tracking to be lost. The concept of a ceiling to the cost of
an object match (γ) is only needed in those situations where other object
silhouettes or transient noise silhouettes could be matched to them.
The value of γ was initially set purely ‘by eye’ in experimentation. Later
developments in the reference standard statistical techniques allow for a more
scientific approach. As a basic model, γ should be higher than any correct
match cost and lower than any incorrect match cost. This assumes the
tracker uses a perfect cost function in which ‘correct’ and ‘incorrect’ match
costs are linearly separable – which is not the case. To gain an impression
of where the optimum value of γ lies, the cost of reference standard (and
therefore ‘correct’) matches are noted for a 1.5 hours sequence and plotted
as a histogram. In total, 7078 match costs are recorded ranging from 0 to a
cost of 25.8. Despite the apparently large spread, 99.4% of the recorded costs
are between 0 and 2.0. All of the outliers over a cost of over 3.0 were found
to be cars entering the scene, presumably producing changes which have not

been seen during the training phases. A logarithmic histogram with a bin
width of 0.1 is shown in figure 3.10.

Figure 3.10: A histogram of the costs of reference (‘correct’) matches, to aid

the choice of value of the γ cost variable

Though this analysis helped to narrow the search for the optimum value
of γ, it should be borne in mind that the concept of correct and incorrect
matches being linearly separable is somewhat inaccurate. Therefore, with
this ‘ballpark’ figure of placing γ between 0.5 and 3.0, a series of tests was
conducted using the reference standard to provide statistics. These analy-
ses, described in Chapter 4, provide the number of ‘lost’ objects and ‘extra’
objects. By ‘lost’ objects, the author refers to the situation where an object
remain unmatched and tracking is lost, when it should have been matched to
a silhouette. ‘Extra’ objects are objects which were instantiated where none
should have been, and are the result of incorrect silhouette matches when an
object should have been removed.
A large value of γ will cause a rise in unmatched objects (‘lost’ objects);
a low value will cause a larger number of ‘extra’ objects. The tracking algo-

rithm was run several times on a 1.5 hour reference standard-marked sequence
with different values of γ. The number of ‘extra’ and ‘lost’ objects for each
value was noted. The definition for what constitutes the ‘best’ compromise
between ‘extra’ and ‘lost’ objects is debatable, and was simply the judge-
ment of the author. Setting γ to a value of 1.5 appears to produce the best
compromise. The results of these tests are briefly summarised in the graph
in figure 3.11.

Figure 3.11: The effect of different values of γ on the number of ‘extra’ and
‘lost’ objects

A second (small) global cost is also added for creating new objects, de-
noted by δ. As a pedestrian is tracked across the screen it may be poorly
segmented and fragment into 2 large silhouettes. Ideally, the tracker should
recognise this and match both silhouettes to the object. However, occasion-
ally, the tracker matches only a single silhouette to the object, leaving a
large unmatched silhouette. The tracker’s policy is to instantiate large un-
matched silhouettes into new objects, and the end result is a poor match
for the current object and a new unwanted object. As will be covered in

detail in Chapter 4, both ‘extra’ and ‘lost’ objects are classified as such in
relation to a parent reference standard object in the previous frame. The
new object created in this example will have no such parent object, and as
such is referred to as an ‘orphan’. The role of δ is to reduce the number of
‘orphans’ in a sequence. In order to softly dissuade the tracker from leaving
large unmatched silhouettes near the current object, a cost of δ is added for
every object that is created on a given frame. Again, statistical techniques
were used to place bounds on where the ‘optimal’ value might lie. As the
value of δ increases, the number of ‘orphans’ decreases. The higher the cost,
the more likely an object is to match to nearby large silhouettes. This has the
effect of ‘soaking up’ all nearby silhouettes of a size that are large enough to
instantiate a new object. If δ is set too high, objects can match to silhouettes
generated by noise or silhouettes which should genuinely instantiate new ob-
jects. Therefore, as δ rises in value the number of ‘orphans’ decreases at the
expense of match quality and a loss of new objects. Statistical techniques
provide guidance when setting the value, but the final choice once again rests
with the author’s judgement. Testing on various complex scenes helped set
the final value at 0.3.
It should be noted that these two variables are set relative to the typical
values that are output from the cost function. Above, the cost function under
consideration was the SOM-based system – as used in the final design. In
order to test the quality of this cost function, its performance is directly
compared with that obtained using the Owens cost function. When using
the Owens cost function, the two cost variables’ values have to be switched
to values which it is believed are optimised for the Owens cost function – to
ensure fair testing. The final values for the Owens cost function are 5.0 for
the γ, and 1.0 for the δ.

The final global cost, g, of a given match matrix, M , is therefore the sum
of all object-silhouette matches, added to which are the costs of unmatched
and new objects:

g(M ) = c(Q, S) + u γ + c δ (3.13)

where u and c are the number of unmatched and newly created objects
respectively. The number of matched objects is denoted by n. S is the
silhouette matched to object Q, once any potential conflicts (merges and
partitions) have been resolved by the conflict resolution module.
In the above formula, all three matching possibilities that an object can
find itself in incur a cost: the object is matched to a silhouette; it is un-
matched and removed; a new object is created.
In the development of this technique for producing a global cost, many
(generally far more complex) cost functions were devised. These involved
different values for γ and δ depending on whether the object was at the edge
of the screen or not. It also involved adding costs for merging silhouette
based on the distance between the silhouettes in an attempt to discourage
the merging of distant silhouettes. The length of time for which an object
has been instantiated was also used as a measure of how likely the object
was to be a ‘true’ object, and therefore the cost of leaving a short-lived
object unmatched was lower than that of other objects. All of these extra
techniques were found to have no advantage, or to even be detrimental to
the performance of the algorithm when tested statistically using the reference
standard data and were therefore removed for simplicity.
Looking at the design of the SOM cost function for the single object
to silhouette match, one might well ask why it is divided into 3 separate
SOMs. There is replication of the inputs and apparently these inputs could

all be combined into a single SOM to reduce overheads. The reason for the
functional division of the 16 inputs into 3 distinct SOMs is a question of what
is sometimes referred to as the ‘curse of dimensionality’. As dimensionality
of the input vector increases, the size of the space that must be modelled
increases exponentially. In this case, a 2-dimensional SOM simply cannot
feasibly model a 16-dimensional input space with accuracy – despite the
intrinsic dimensionality of the input vector being considerably lower than 16
dimensions. Several different arrangements were tested, however, with the
largest being a single 14-input SOM. Predictably, the results from this single
SOM were disappointing and an attempt was made to keep the number of
inputs as low of possible for each SOM. It is possible that this particular split
of the SOMs reflects an underlying independence of the variables across the
Another possibility for modelling the cost function is the use of a k-means
clustering algorithm instead of a SOM. In many situations, the k-means
algorithm can perform the same ‘novelty detection’ kind of functionality that
SOMs provide. Unfortunately, this option could not be explored due to time
constraints. It is likely that a very similar performance would have been
obtained from replacing the SOM methods with k-means. There is no clear
theoretical advantage of using one technique over another.
This section has presented both the local object-silhouette match cost
function and the global cost function. Clearly, for the SOM cost function to
be effective it must be trained on the same scene as it is required to track
on, but it offers significant performance advantages over a simpler metric.
Chapter 4 compares the performance of the SOM cost function with simpler
cost metrics to reinforce this claim.
The concept of an object to silhouette match cost is central to the search

for the ‘best’ match matrix in the Search Function module. For an object to
silhouette match cost to be assessed, the matching must be one-to-one. This
is not always the case, as the match matrix may assign several silhouettes
to a single object and several objects to a single silhouette. Therefore, the
conflict resolution step is required to resolve a given match matrix into one-
to-one matchings between objects and silhouettes. To do this, silhouettes
must be merged and partitioned as described in the following section.

3.3.3 Object Merging & Partitioning

This section deals with merging – where several silhouettes are joined to form
a single silhouette – and partitioning – where a single silhouette is split into
several silhouettes. The first of these two tasks is relatively trivial.
Each silhouette is associated with a feature vector f = [xc , yc , a, h, w, g].
This consists of the centroid (x and y), area, height and width (of the MBR),
and a 16-bin normalised histogram. To calculate the features of the merged
silhouette resulting from silhouettes S1 and S2 , a sum – weighted by area – is
performed. Sm is the silhouette resulting from the merge, and each feature
is subscripted with its relevant parent silhouette. The new centroid and area
are calculated as detailed in equations 3.14, 3.15 and 3.16.

am = a1 + a2 (3.14)

xc1 a1 + xc2 a2
xcm = (3.15)

yc1 a1 + yc2 a2
ycm = (3.16)
A given histogram bin b, is calculated in the using the equation 3.17.

b 1 a1 + b 2 a2
bm = (3.17)
The MBR of the merged silhouette is simply recalculated as the smallest
rectangle that encloses both silhouettes. The width and height of this are
then extracted and placed into the new feature vector.
This is equivalent – barring any small rounding errors – to assuming
that these two silhouettes had simply initially been segmented as a single
foreground region. The operations described above are performed – rather
than merely adding all the pixels up and performing a full feature extraction
again – because it is significantly more computationally efficient.
Though the example above describes only the merging of two silhouettes,
the results of one merger can then be merged with other silhouettes indef-
initely to produce macro silhouettes that are the results of any number of
other silhouettes. It should also be noted that a silhouette to be merged can
be the result of an earlier partitioning.
The converse situation, where a single silhouette needs to be partitioned
into several silhouettes, is far less trivial. The following partitioning tech-
niques assume that only minimal overlap occurs between objects. In other
words, heavily occluded objects will not be partitioned correctly. Thankfully,
on scenes of relatively low activity, heavy occlusions are relatively rare. One
of the reasons for this is the high-level feedback that incorporates parked cars
into the background. If this were not the case, pedestrians would exit the car
and immediately cause heavy occlusion between the two objects. The most
common situation where occlusion occurs is when two pedestrians walk close
together side by side, where overlaps tend to be minimal. The larger the
overlap, the less precise the partitioning but only very large occlusions cause
the partitioning algorithm to fail completely. This will later be re-examined

As with many of the modules in this system, several different techniques
were developed, tested and compared. All attempt to solve the same problem,
which can be characterised as follows. Given a silhouette, find the line that
best divides the pixels into the parts which belong to each object. Using
a line is in itself only an approximation, which nonetheless has proved to
be surprisingly accurate. An initial technique is now briefly presented as a
possible approach and serves as an introduction to the techniques used in
the final design.
As two objects merge together to produce a single silhouette, the result-
ing silhouette will tend to be elongated along the axis which passes through
the centroids of each object. This basic assumption can be used as a starting
point to produce a rough approximation of the partition. A Principal Com-
ponents Analysis (PCA) is performed on the group of pixels that make up
the silhouette to produce a best fit line crossing the centroid of the silhou-
ette, and running along the axis that minimises squared ‘error’. A second
line, placed orthogonally to the first line, splits the silhouette into two sec-
tions. The position of the second line along the first is such that it splits the
silhouette into areas which are proportional to the area of the object they
In image (a) of figure 3.12, the PCA line is calculated for the silhouette
and an orthogonal partition line – which will eventually split the silhouettes
– is calculated. The partition line is moved to a position such that it divides
the silhouette into areas which are proportional to the object areas they
represent. Image (b) shows the two resulting silhouettes, and image (c)
displays the MBRs of the silhouettes against the input video image. Though
not perfect, the result of the partition – in this case – is relatively close to

Figure 3.12: The PCA object partitioning algorithm

Whilst the conceptual approach outlined above is the easy to visualise,
the algorithm which implements it requires a more formal approach. There
may be an arbitrary number of objects into which the silhouette must be
split. To calculate the area, ai , of silhouette allocated to each object, Q,
equation 3.18 is used.

ai = Pm s (3.18)
k=1 qk

where qi is the area of one of the m objects involved, and s is the area of
the silhouette being partitioned. A central concept in the following sequence
of steps, is the dot product of a given pixel onto the PCA line. A projection,
p, on the PCA line, L, given a pixel vector, z, is calculated using the dot
product of the two vectors, as shown in equation 3.19.

p=z·L (3.19)

This is shown pictorially in figure 3.13.

The sequence of steps taken by the algorithm to partition the silhouette
are best summarised into the following stages:

Figure 3.13: Projecting a pixel onto the PCA line using the dot product

• Project all pixels onto PCA line: The position of each silhouette
pixel is projected onto the PCA line, placed into a list and sorted by
projected position into an ascending order.

• Project object centroids onto PCA line: The position of each

object (centroid) is project onto the PCA line, and as before placed
into a list and sorted by projected position into an ascending order.

• Allocate pixels to objects: The first af irst pixels in the list are
allocated to the first object in the list (where f irst is the index of the
first object in the list). The next asecond pixels are allocated to the next
object and so on.

By ordering all the pixels and all of the objects by their position on the
PCA line, we ensure that each pixel is allocated to the correct object. On
a horizontal PCA line, for example, the leftmost pixels will be allocated
to the leftmost object in the partitioning. Visual inspection of many test
cases reveals that this type of ordering is extremely reliable. Whilst the
algorithm which performs the partitioning once the PCA line has been found
is extremely reliable, the weak link in this algorithm is the initial positioning
of the PCA line. In formulating the PCA approach, an implicit assumption
has been made that two merged objects will produce sufficient silhouette
elongation to give an accurate PCA line. However, this assumption is only

truly accurate if the objects involved have a roughly 1 to 1 height/width
ratio. For example, if two pedestrians stand side by side and merge, their
main PCA line may be vertical due to the merged silhouette still being taller
than it is wide. Unfortunately, this occurs relatively frequently. An example
is shown in figure 3.14.

Figure 3.14: The PCA-based partitioning of two merged pedestrians. The

merged silhouette is taller than it is wide, causing partitioning to fail.

Several steps can be taken to correct this, if one assumes that one is
always dealing with pedestrians, by scaling down the Y axis of the image by
the average height/width ratio of a pedestrian. Therefore, it seems sensible
to divide the Y component of each pixel by this ratio so that the assumption
of an aspect ratio of 1:1 is preserved. Of course, rather than do this literally
the Y component in the covariance matrix resulting from the PCA is simply
scaled. The average pedestrian aspect ratio in this scene is estimated – using
reference standard information – to be around 2.5 : 1. The true aspect ratio
of a human being is perhaps closer to 6 : 1, and this lower ratio of 2.5 : 1
is the result of foreshortening due to the camera’s elevated position. Results
from the same scene as above using the new ‘scaled’ technique are shown in
figure 3.15.
The scene in figure 3.15 is typical of all the pedestrian partitionings at-
tempted on real data. This scaling technique also works well at unusual
angles. To test this thoroughly, the silhouettes of two pedestrians have also

Figure 3.15: Results from a scaled PCA partition on real data

been merged artificially at different angles. A sample of these tests is shown

in figure 3.16.

Figure 3.16: Results from a scaled PCA partition on artificially merged sil-

Whilst results on pedestrians using this technique are excellent, this tech-
nique should be sufficiently robust to be able to partition silhouettes resulting
from any combination of cars and pedestrians. Unfortunately, due to cars
having a different average aspect ratio the technique sometimes performs
very poorly on these relatively rare occasions.
Instead of visualising the problem as a step by step solution to find the
best line (such as PCA), it can once again be broken down into a searchable
space in which a cost must be minimised. The search space is a number of
candidate angles at which to fit the line. The cost can be calculated using the
cost of matching the resultant silhouettes with the partition objects, using

the SOM cost function described in section 3.3.2. This technique, instead
of the PCA algorithm, is used in the final design of the algorithm. In the
following example, angles are measured in radians relative to the horizontal,
with angles increasing as the line is ‘spun’ clockwise. The algorithm tries β
different angles ranging from − π2 radians (vertical) up to (but not including)
radians (vertical once again). All lines intersect the centroid of the silhou-
ette. In the final design, β is set to 30. This can be set to a lower number to
ease the processing requirements. For each candidate angle, the silhouettes
are split as described earlier for the PCA algorithm. The object to silhou-
ette costs are then calculated and added together. The angle producing the
lowest cost is assumed to be the best match.

Figure 3.17: Angle-search method – (a) 30 different angles are assessed, and
(b) the best angle is chosen

Image (a) in figure 3.17 shows the 30 different partition angles used to
split the silhouette, and figure (b) displays the results of the lowest cost angle.
Figure 3.18 plots the total cost of each angle. The centre of the graph shows
a clear dip in the cost where the best results lies (around the 0 radian mark
– horizontal). Due to the reliance on the object-silhouette cost function, the
testing on artificial images would require entire sequences of images to be
fabricated – rather than a single frame. Therefore, testing is limited to real
data only.
The results from this technique are excellent, with very few examples of

Figure 3.18: Graph plotting the total cost as different angles were tested. A
very clear dip is visible in the centre of the graph where the true best result

complete failure of the algorithm. Generally, where the algorithm fails, one
of the four main assumptions no longer holds true. These are:

• The previous frame’s objects were tracked correctly

• There is little occlusion between the objects

• The objects can be divided along a single line

• There is only a small change in area from the previous frame

As can be seen in figure 3.19 picture (d), not all partitions are perfect but
‘good enough’ for reliable tracking to continue. In this case, the imprecision
is caused by the fact that the car is moving away from the camera and
the area it occupies is decreasing. At the same time, the pedestrians are
moving towards the camera and increasing in area. However, the partition

Figure 3.19: A sample of four partitions tested to assess the quality of the
search algorithm. (a) Partitioning 2 cars, (b)(c) 2 pedestrians, (d) 2 pedes-
trians and a car

algorithm has assumed the area has remained stable and has therefore over-
allocated area to the car. This effect increases if partitioning occurs over
many frames without the silhouettes appearing separately to ‘reset’ the areas
of the objects. Thankfully situations in which this is a problem – where
objects are moving in opposite directions – tend to lead to a very small
number of contiguous partitioning frames. A similar problem with incorrect
areas occurs if an object coming onto the scene merges with an object already
on the scene. The emerging object’s area will have been underestimated due
to its being partially beyond the vision of the camera, reducing the quality
of the partition. These issues could be addressed by searching for a better
measure of area in the local neighbourhood. However, as this would have to
be repeated for every angle it would be too costly and is not guaranteed to
improve the results. An increase in degrees of freedom can often lead to a
decrease in performance.
Over several contiguous frames of partitioning, the effect of the quality of

the previous match on the current match becomes clear. Over several frames,
the objects’ features might drift significantly from their true values. Thank-
fully, the nature of the SOM cost function – in particular the Appearance
SOM – tends to keep these values to that which is expected to be the norm.
This constant redressing to the norm helps the algorithm perform extremely
well even during long periods of sustained partitioning.
Although rare, there are of course situations in which partitioning fails
completely. Heavy occlusion alone is rarely the cause. Only when objects
are simply not linearly separable does partitioning fail completely. Figure
3.20 illustrates one such scenario in which a car and pedestrian overlap sig-
nificantly, causing a partitioning failure:

Figure 3.20: An illustration of two poor partitionings, taken one second apart

The 2 left side images are the earlier frame. The two rightmost images are
taken 4 frames later. The top images show the partitioning of the silhouette,
and the bottom images are the resulting MBRs overlayed onto the original
video frame. The small amount of overlap in the earlier frame causes little
problem for the partitioning algorithm. As the frames progress however, the
overlap becomes such that the pedestrian and car can no longer be separated
along a single line, cause partitioning to fail. This kind of heavily overlapping
partitioning occurs relatively rarely. In these situations (including the one
depicted above), the partition will cause a high cost and the tracking of one

object (the pedestrian in this case) to simply be lost. When the partitioning
ends, tracking of the lost object resumes normally. Typically, this will last
for less than 3 seconds.
In the course of this work, several other non-linear techniques have been
explored in an attempt to provide a yet more robust solution to the problem.
Snakes and areas of high-gradient are unfortunately too unreliable, mostly
due to the poor quality of video and lack of significant gradient boundaries
between objects. Also attempted was splitting the silhouette into regions of
homogenous brightness (using a watershed algorithm), in the hope that the
silhouette could be split along the boundaries of these regions. Results are
disappointing, however, as region boundaries rarely match the true boundary.
In summary, the linear-split search algorithm approach provides a good
partition in a large majority of cases, whilst heavy overlapping can cause par-
titioning to fail. However, the object tracker will often recognise this, discard
the offending object and continue tracking once partitioning has ended. Im-
plementing the search algorithm required considerable technical changes to
the system. The reason for this is that the partitioning must be performed
after any mergings to produce a sensible silhouette-object costing. If two
pedestrians have merged into a single silhouette and one pedestrians’ legs
are present in a separate silhouette, the two silhouettes must first be merged
before partitioning for the cost function to perform accurately. Therefore, the
conflict resolution step must ensure that merges have priority – chronologi-
cally speaking – over partitionings. Many other minor technical challenges
arise from the fact that the partitioning step must be able to accept any
partitioning combination, regardless of whether they make sense. Resulting
silhouette areas may be less than a single pixel, for example. Therefore,
some partitions produce negative results which are ‘propagated up the algo-

rithm’ and inform the search algorithm that a given match matrix is invalid.
The cost-driven approach to partitioning seems natural in the context of a
cost-driven matching algorithm. One advantage of this technique is that an
improvement in the cost function results in both better object matching and
better partitioning.

3.3.4 Conflict Resolution

This subsection describes the conflict resolution stage, whose aim it is to take
a candidate match matrix and perform the relevant merges and partitions
such that the end result is a one to one matching between objects and silhou-
ettes. One might assume this step to be trivial, given that both the merging
and partitioning procedures have been defined in section 3.3.3. However, the
order in which merges and partitions are performed affects the quality of the
outcome, and merges must generally be performed before partitions. The
algorithm must also be able to deal with any candidate match matrix, even
if it is not – in human terms – a sensible matching. The algorithm must
therefore be designed in a generalised manner so as to be able to process any
possible match matrix.
In the overview of the object tracking algorithm, the match matrix is
defined as an n × m matrix, where there are n objects and m silhouettes.
In the matrix entries, a value of 1 represents an object-silhouette match and
a value of 0 the lack thereof. Although the match matrix is the physical
structure used in the implementation of the algorithm, it is also useful to
visualise the match matrix as a graphical mapping of objects and silhouettes.
A 4 × 5 candidate match matrix, Mc , and its equivalent bipartite graph are
shown in figure 3.21.
In resolving the conflicts, the algorithm makes use of two further states in

S0 S1 S2 S3 S4
Q0 1 1 0 0 0
Mc = Q1 1 1 0 0 0
Q2 0 0 1 0 0
Q3 0 0 1 0 1

Figure 3.21: A candidate match matrix, Mc , illustrated as a bipartite graph

the match matrix other than ‘0’ and ‘1’. The states are summarised below:

 0 No link between silhouette and object

 1

Link between object and silhouette that needs resolving
Mc (q, s) =

 2 Secure link between object and silhouette

 3

Silhouette-object match no longer possible

where q and s are arbitrary object and silhouette indices, respectively.

The candidate match matrix is changed internally within the conflict
resolution function as the algorithm progresses. Only the ‘0’ and ‘1’ states
are present in the match matrix that is passed to this function. The ‘2’ state
is the final secure state that is obtained from the conflict resolution, and will
eventually represent a one-to-one match between an object and a silhouette
– that is that they are matched only to each other. The ‘3’ state is merely
a result of the internal functioning of the algorithm, and simply serves to

denote the fact that a silhouette-object matching is no longer possible. The
silhouettes which resulted from the object segmentation step are referred to as
the original silhouettes. When original silhouettes are merged or partitioned,
the extra silhouettes which are created are appended to the end of the list of
original silhouettes and the match matrix is suitably extended. These extra
silhouettes are known as generated silhouettes.
When a silhouette is partitioned between two objects, for example, the
two generated silhouettes are appended to the list and the match matrix
suitably extended. Links from the objects to the original silhouette are then
set to ‘3’ and the appropriate new links to the generated silhouettes are
created (‘2’).
The conflict resolution can be split in to 3 steps, the first two of which
are repeated until there are no longer any unresolved links (no ‘1’ states in
the match matrix), followed by a final consolidation step as seen in figure

Figure 3.22: An overview of the steps taken by the conflict resolution module

Step 1: Resolve fully connected subgraphs: This step seeks to find

all complete bipartite subgraphs within the matrix, and resolve them to one-
to-one matchings. A good example of such a subgraph is visible at the top
of figure 3.21, where objects Q0 and Q1 are fully connected to silhouettes

S0 and S1 . No object or silhouette within the subgraph is connected to any
other outside the group. Stated formally, given the two sets of objects and
silhouettes, Q and S respectively, the algorithm seeks to find an object subset
R and a silhouette subset T such that the conditions listed in equations 3.20,
3.21 and 3.22 all hold.

∀q ∈ R ⋆ ∀s ∈ T ⋆ Mc (q, s) = 1 (3.20)

∀q ∈ R ⋆ ∀s ∈ S ⋆ (Mc (q, s) = 1) =⇒ s ∈ T (3.21)

∀q ∈ Q ⋆ ∀s ∈ T ⋆ (Mc (q, s) = 1) =⇒ q ∈ R (3.22)

In order to make such a search computationally efficient, the objects are

sorted into bins according to their valency, where valency (also known as
degree) is the number of silhouettes to which an object is connected. The
subsets within each bin are then searched, reducing overheads. With a fully
connected subgraph found, the first step is to merge the silhouettes in the sub-
set if there is more than one – this temporary silhouette is stored separately
to the list of silhouettes. Links to the subset silhouette(s) are broken (set to
‘3’). If there is more than one object, the temporary silhouette is partitioned
as described in section 3.3.3 and the generated silhouettes are added to the
end of the silhouette list. New one-to-one links (‘2’) are created between
the objects and the, possibly newly generated (if merging/partitioning was
necessary), silhouettes. Conceptually, the reasoning behind this approach is
relatively simple. If there is a small group of objects travelling in close prox-
imity made up of several silhouettes, and the search function has found them
to all share the same silhouettes, we simply merge all silhouettes and then
partition them. At the end of step 1, there are no longer any fully connected

Step 2: Resolve non-fully connected subgraphs: Step 2 seeks to
further reduce the number of insecurely matched silhouettes in the hope that
this may yield more fully connected subgraphs, which can then be matched
by step 1. Step 2 begins by finding the lowest valency silhouette. If the
silhouette valency is 1, the object-silhouette match is made secure (‘2’) –
this is the most common outcome. Otherwise, the silhouette is partitioned
between the objects it matches, and the matches are updated to reflect this.
For each new object-silhouette secure match, the area of the allocated silhou-
ette is subtracted from the object area. If this step were not taken, further
partitionings in step 1 would be marred by incorrect area allocations. These
adjustments to object areas are rolled back at the end of the conflict resolu-
tion module. Step 2 is somewhat less ‘mathematically pleasing’ than step 1,
in that the result it produces may not be optimal. This step is only neces-
sary as a result of the fact that the conflict resolution stage must be able to
deal with all possible combinations of the match matrix. An example of the
necessity of step 2 can be seen in the lower part of figure 3.21. A typical sit-
uation where this might occur is when two pedestrians walking side by side
merge across their chests, but the legs of one pedestrian are disconnected
from the merged silhouette. Once step 2 is complete, the match matrix is
checked for unresolved matches (‘1’). If any remain, the algorithm returns
to step 1.
Step 3: Consolidate secure matches: If no unresolved matches re-
main, the match matrix is checked for objects which are matched to several
silhouettes with a ‘2’ matching. In such a case, the silhouettes are merged
into a new single ‘generated’ silhouette which is added to the list of silhou-
ettes and the match matrix is extended. Links to the original silhouettes
are then broken, and a single new link to the newly generated silhouette is

created. This matching of several silhouettes to one object can arise since
step 2 may have securely matched an object to a silhouette, and step 1 later
matched the same object to another silhouette. This finalising step serves to
ensure that all matches are one-to-one, ready for the cost function.
An alternative method to step 2 was briefly considered, whereby to re-
solve situations in which fully complete subgraphs do not exist the required
connections to make a subgraph complete are simply added. In figure 3.21,
this would add a connection between Q2 and S4 . On reflection however, it
becomes apparent that this is equivalent to reducing the size of the search-
able space of possible match matrices, and simply making sure that such
incomplete subgraphs are never sent to the conflict resolution step in the
first place would be preferable. Such a change would naturally be made to
the search function, not the conflict resolution module.

3.3.5 Search Function

The role of the search function module is to search the space of all possible
match matrices in order to find the lowest cost matching between objects
and silhouettes. The eventual product of this stage is the lowest cost match
matrix found, and the resulting silhouette-object matches generated by the
conflict resolution stage. To recap, the first step is to generate the valid
match matrix V , which defines which object to silhouette matches are deemed
potential matches. For each possible match in the valid match matrix, the
corresponding match matrix value can be set either to a match (‘1’), or no
match (‘0’). The size of the search space is therefore dictated by the number
of potential matches in V . If m is the number of potential matches in V , the
number of potential match matrices, s, is shown in equation 3.23.

s = 2m (3.23)

In its simplest form, the search function performs an exhaustive search

through the space of all possible candidate match matrices. This approach
will be discussed first, followed by the greedy algorithm which is used in the
final design.
To assess the global cost of a candidate match matrix, the exhaustive al-
gorithm first sends the match matrix to the conflict resolution module. This
resolves the matchings to one-to-one matches only. This one-to-one matching
is then sent to the global cost function. Only the best cost value and the best
candidate match matrix need to be kept in memory, and these are replaced
when a match is found that has a lower cost. The exhaustive approach was
the first to be developed because of its simplicity and complete approach
to the problem. This complete approach also proved to be extremely use-
ful in the development of an accurate cost function. The statistical analysis
described in Chapter 4 allows for a quantitative comparison of the perfor-
mance of different tracking pipelines. Using the exhaustive search function,
it is possible to directly compare the performance of different cost functions.
Since it is certain that all candidate match matrices have been tested, if a
given cost function outperforms another then the cost function alone must be
responsible for the superior performance. If an incomplete greedy algorithm
had been used in the testing, it would have raised the possibility that the
greedy nature of the algorithm was limiting the search in a way that was
advantageous to one cost function and not the other. With the relatively
high certainty of having reached a stable, accurate cost function, the greedy
algorithm could then be developed. Another use of the exhaustive function
is to aid in the development and testing of the greedy algorithm. Knowing

the result of the search of the entire subset, the aim of the greedy algorithm
is to reach as close to this result as possible using relatively little processing
time. A statistical analysis of performance and comparison of cost functions
is provided in Chapter 4.
With a complexity of O(2m ), the performance analyses using the exhaus-
tive function have to be performed offline. With run-times on some sequences
of 1.5 hours of video running close to a week, several methods can be used to
speed up the process. In general, the search space appears to be inseparable
– that is that the problem cannot be dissected into smaller, and therefore
more tractable, components. If an individual object allocation is chosen such
as to minimise cost, it cannot be assumed this is the best allocation globally
since this allocation may affect the costs of other objects and may itself be
affected by other objects. This is due to the effect of partitioning. It cannot
be assumed that an object will be allocated the whole silhouette that it is
linked to in the match matrix. This interdependency of matches between
objects points to an NP-complete problem, in which the optimal solution
can be found only by exhaustively searching the entire space of possibilities.
Whilst this is true in the general case, some cases do have the property
of separability. One such case appeared in section 3.3.3 and is now repeated
In this situation, ‘decisions’ made in the upper part of the graph cannot
influence the costs associated with the lower part of the graph. This means
that the space related to the upper part can be searched first, then lower
part later. The computational effort, when treated as a single searchable
entity, is 27 = 128 combinations. When split, this drops to 24 + 23 = 24. To
give a more tangible aperçu of the situation, it occurs when several groups of
objects are sufficiently far away from one another onscreen so as to be totally

Figure 3.23: A valid match matrix, V , illustrated as a bipartite graph. The
search space can be divided into 2 along the dotted line.

separate silhouette allocation problems. This concept was briefly tested in

software. The valid match matrix is initially processed to make the problem
separable (if possible), by swapping rows and columns (and the correspond-
ing objects and silhouettes) such that objects and silhouettes in the same
subset are adjacent. The valid match matrix can then be decomposed and
each section sent to the exhaustive search algorithm. This presents consid-
erable technical challenges, and can require a considerably different software
architecture than simpler approaches. Unfortunately, this method provides
little relief from the longest and most complex scene calculations. This is
because some of the least tractable frames tend to occur when perhaps three
objects appear in close proximity to one another and silhouette fragmen-
tation and noise cause a combinatorial explosion. Nevertheless, in scenes of
high activity this technique could prove to be extremely useful and could also
reduce overheads with search algorithms other than the exhaustive approach.
Due to an already complex code base, a relatively small improvement in per-
formance and considerable technical difficulties in integrating the technique
into the algorithm in general, the technique is not used in the final design.
Several other techniques are used to improve performance of the exhaus-

tive search algorithm. These generally concentrate on trimming the search
space without compromising the certainty that the final result is optimal.
One potential method is to measure the maximum object silhouette match
distance in the reference standard. Using this distance, one can reduce the
search radius r used to generate the valid match matrix and therefore reduce
the number of possible matches in V without compromising the certainty
that the result is optimal. Unfortunately, by cutting out part of the search
space, it is also being falsely reduced. If there exists a false matching with a
lower cost that is beyond r, one could incorrectly assume that the algorithm
is performing admirably.
The following technique is a form of the common branch and bound al-
gorithm, which aims to trim the search space with a minimum bound on the
cost of object-silhouette matchings. It is possible to view the search space as
a tree search, starting at the first object’s allocation and descending down
the tree by allocating the next object and so on. Obviously, as stated earlier
it is impossible to reach a conclusive cost for the allocation of a given object
because of the inseparability of the search space. It is possible, however, to
place a lower bound on the cost an object might incur. This lower bound
is calculated in the case of the cost function based on a measure of vector
differences. Though it may be possible to place a bound on the cost in the
case of the SOM cost function, this was not attempted due its relative com-
plexity and time constraints. The technique of finding a lower cost bound
focuses on placing upper and lower bounds on the MBR produced by silhou-
ettes. Given a single silhouette match, the resulting maximum sized MBR is
the original silhouette MBR, and minimum is size 0. This is because other
objects may cause partitionings of this silhouette. When merging two silhou-
ettes the ‘maximum’ MBR is that which encompasses both silhouettes, and

the minimum is the result of horizontal and vertical distances between the
two MBRs, if any. Similar conclusions can be drawn using area and silhouette
centroids. Using this information it is possible to place a minimum bound on
the cost of an object-silhouette matching. Before the exhaustive algorithm
begins, the greedy algorithm (described shortly) is run and its global cost
is noted. Starting at the top of the tree, the algorithm begins by allocating
the first row (object) of the match matrix. The minimum cost bound for
that object is calculated. If this cost is above the lowest cost found so far
(initially this is the greedy cost) the tree is ‘pruned’ – the algorithm does not
descend further into the tree – and the next possible object-silhouette allo-
cation is tried. Otherwise, the search function proceeds to the next object
in the list and repeats the process, adding the minimum cost of this object
to the costs of all those above it to provide a minimum bound on the global
cost. Again, if the minimum cost is higher than the lowest cost found so far,
the algorithm does not proceed further down the tree. Eventually, the search
function will reach the bottom of the tree where an entire match matrix is
allocated and will calculate the global cost of the matching. If this cost is
lower than the minimum cost, it replaces the minimum cost and becomes
the best match found so far. At the bottom of the tree, all combinations of
possible matches are tried for the final object. When these are exhausted,
the algorithm returns to the object immediately above it, changes its allo-
cation to one that has not been tested yet and once again tries all possible
combinations of the final object. This process is recursive and ensures that
all possible combinations of the match matrix are tested, excepting of course
those parts of the search tree which are pruned. Despite the very rough cost
estimates, this technique is extremely effective at cutting down the search
space. A percentile of the full space searched appears on the GUI to provide

feedback on how effective the technique was. The results vary widely, but
it is most effective on extremely busy scenes – those which suffer most from
combinatorial explosion – and regularly cuts the search space by a factor
of 10 and even occasionally up to 100. This technique has helped to con-
siderably reduce the time taken to perform statistical analyses when using
the simpler cost functions. Unfortunately, a similar lower bound for the cost
function is impossible to find with certainty when using the SOM cost func-
tions. Therefore this technique was used only in the context of the feature
vector difference based (Owens) cost function.
As briefly discussed in the previous section, another potential way in
which the search space can be reduced is to consider only those matchings
consisting of fully connected subsets. This would reduce the search space
considerably but would require the development of an algorithm that is quite
radically different in structure to that presented thus far. Also, this algorithm
could not truly be considered to be exhaustive and at the same time would
not fit the requirements of a greedy algorithm.
In order for the algorithm to execute in real time, a greedy function was
developed whose aim is to provide similar performance (to that of the ex-
haustive search) with a complexity that is sub-exponential and can therefore
operate at the target minimum rate of 4 Hertz on even the most complex
scenes encountered. The algorithm that follows is closely related in some
concepts to the Owens algorithm. The major differences are the change in
cost function, partitioning technique and a shift away from assessing cost
object-by-object to a more global cost function later in the algorithm. This
algorithm also does not deal with the concept of transient objects – that is,
objects which have only been present for a short amount of time. The Owens
algorithm is strongly preferential to non-transient objects, where clashes oc-

curred. The theory behind this is that noise can cause objects to appear
over short periods of time, therefore short-lived objects are less likely to be
‘real’ objects and are treated as ‘second-class’ objects. However, a qualitative
analysis of the performance of the following algorithm adjusted with a similar
bias revealed little or no difference in performance. For the sake of simplicity
and clarity, the steps responsible for non-transient bias were removed.
The greedy algorithm can be dissected into 4 distinct steps as follows:

• STEP 1: Naive match Each object in turn is matched to the silhou-

ette which yields the lowest object to silhouette cost. This search is
limited only to those object silhouette matches which are valid – where
Vji = 1. This is performed object by object, regardless of whether a
given silhouette has been matched to a previous object. Formally, all
entries in the match matrix are initialised to zero.

Then for each object, Qj ,

Mji = 1 where i = arg min(c(Qj , Sk )) where Vjk = 1 (3.24)


At the end of this brief step, each object is matched to a single sil-
houette. However, a single silhouette may be matched to several ob-
jects. An object will be unmatched only if it has no potential silhouette
matches. The development of a reference standard and statistical anal-
ysis technique, described in chapter 4, allowed the confirmation that
this is indeed a sensible first step. In 98.7% of object-frame matches,
the reference standard match matrix confirmed that this silhouette was
at least part of the match. This figure also reflects the quality of the
neural cost function. The figure of 98.7% was obtained using the neu-

ral cost function; a figure of 96.6% was obtained on the same sequence
using the feature vector difference cost function2 .

• STEP 2: Identify potential merges By the term ‘merging’, the au-

thor refers to the close proximity of two objects causing them to merge
into a single silhouette. Due to the naive object-by-object approach,
several objects may be matched to the same silhouette. This step aims
to detect when a match conflict is a true merging event, and when
it is simply due to imprecisions in the costing function. To do this,
it creates a Macro-object consisting of the merging of all the objects
involved using the same techniques as described in section 3.3.3. The
only minor difference is that the object positions and MBRs are pro-
jected forward one frame prior to merging, using their current velocity
as an estimate. The new macro object, Θ, has a single feature vector.
This allows us to obtain a cost of matching Θ to the potentially merged
silhouette, Sm , using a technique that is a slight adaptation of the nor-
mal costing function. To calculate the costs, the object, Qb , which has
the lowest cost match to the silhouette Sm has to be found within the
list of relevant objects.

Qb where b = arg min(c(Qk , Sm ) (3.25)


The cost of the macro-object to silhouette match has the following

differences. Instead of the features of area, height and width differences
being scaled by the macro-object’s own features, they are scaled by
the best matching object’s features. For example, area difference is
Tests were conducted on a 1.5 hour sequence of video. 7164 object-frames in total.
The term object-frame refers to an object’s match in a given frame

calculated as follows:

amacro − aS
pArea = (3.26)
where amacro , aS and abestobj are the areas of the macro object, silhou-
ette, and best matching object respectively.

The height difference and width difference are calculated using an iden-
tical method. It is these inputs which are then applied to the normal
costing function as described in section 3.3.2. This follows the macro
object cost scheme developed by Owens[20].

With the costs calculated, the algorithm then decides whether a true
merging event has occurred using the cost function and a very simple
discriminatory function, as shown in equation 3.27.

 true True merge has occurred
c(Θ, Sm ) < c(Qb , Sm ) =
 f alse True merge has not occurred

Put simply, the macro-object and the best-matching individual object

are compared. If the macro object is a better match, it is assumed a
merge has occurred.

If a merge is deemed to have occurred, the match matrix is left un-

changed. If a merge has not occurred, only the best matching object
retains its match to the silhouette. Each other object then searches its
list of possible matches to find the next lowest cost silhouette. This
operation is conducted naively, and can lead to further object match-
ing conflicts. If this is the case, this step is repeated until no further
conflicts remain.

• STEP 3: Search unmatched silhouettes At this stage it is still the
case that a single object will be matched to no more than one silhouette,
despite the fact that objects frequently consist of several silhouettes.
This step seeks to address this by matching each unmatched silhouette
to the object to which it is most suited, if this helps reduce the global
cost. The step cycles through each unmatched silhouette. For each one,
it is matched to each object in turn and the global cost is calculated (via
the conflict resolution and global cost function). A given unmatched
silhouette is matched to the object which incurs the lowest global cost
possible, but only if this produces a lower global cost than leaving it

• STEP 4: Remove poor matches All the operations that have been
conducted thus far have been comparative, rather than absolute. Put
differently, costs have been compared but no absolute threshold has
been placed on the required quality of a matching in order for it to
be considered valid. Therefore, objects may be matched to silhouettes
even if the match is clearly a very poor one. For example, a car may
have exited the scene in this frame. If a small area of noise is still
present that is within the valid match radius r, the algorithm will have
matched the object to it when the best course of action would be to lose
tracking of the object. This step detects and removes poor matches.
To achieve this, each object in turn is unmatched from its silhouettes
and the global cost is recalculated. If the new cost is lower, the object
is left unmatched. Re-calculating the global cost for each object is of
course an expensive operation.

There are ways this could have been made to be more efficient. In
essence, the algorithm checks whether each object’s matching cost is

below the global cost function’s constant γ (cost of each unmatched ob-
ject). Therefore it may seem natural to only calculate the local object-
to-silhouette cost, and if it is above γ, the object could be switched to
being unmatched. Looking into the problem more deeply however, it
becomes clear that unmatching an object not only incurs the γ, but may
also affect other object matching costs due to partitionings. Therefore,
despite the higher processing requirements, the global cost approach
was taken in order to ensure that the global cost can only be decreased
by this step.

The output from this algorithm – the best match matrix found – is sent
to the conflict resolution stage to merge and partition silhouettes such that
the final result is a one-to-one matching between objects and silhouettes.
Objects which have no match are removed from the list of tracked objects.
Objects which have been found a match are updated using the features of
the silhouette to which they have been matched. Each object has an associ-
ated history of the centroids since its creation, referred to as the path. The
silhouette centroid is appended to the path. All the other features of the
object are directly inherited from the silhouette, ready for the next frame’s
matching. The possibility of using a Kalman filter to smooth both the path
and the various elements of the feature vector was considered, and this issue
is dealt with in section 5.2 (Further Work).
Finally, the resulting MBRs of the objects are displayed onscreen together
with the silhouettes which formed them. The path is also plotted onscreen.
Objects identified as being pedestrian are enclosed by a double thickness
MBR – as opposed to vehicles which are enclose by single thickness – to
denote their classification.
This simple, four step algorithm proves to be remarkably close to the

exhaustive algorithm in terms of accuracy, whilst using only a tiny fraction
of the processing overheads. The overall complexity of the algorithm is of
the order of O(n × m), with the area of silhouettes also playing a large role
in processing requirements due to the partitioning algorithm. An assessment
and discussion of the performance of this function is discussed in Chapter 4.

3.4 Object Classification

For the information gleaned about objects tracked across the screen to be
truly useful, the type of object must be known. In this case, objects are
classified as either pedestrians or vehicles. Only these two classes of object
are visible in the video footage of car parks that was available. Other objects,
such as dogs or cyclists, are not present in any of the footage. Should the
need arise, the techniques described here could be adapted to differentiate
between several more classes of object. The need for high accuracy of this
object classification module is heightened by the dependence of the SOM-
based cost function on object type. An incorrect object classification could
lead to poor tracking.
This classification problem can easily be summarised as follows. Given
an object’s input feature vector f = [xc , yc , a, h, w, g] – consisting of centroid,
area, height, width and histogram – it must be classified into one of two
classes: pedestrian or vehicle. The histogram is not a useful feature, and is
not used in the classification task.
The simplest statistical approach is to make use of Bayesian statistics
to place thresholds on features that show the greatest variation between
the classes – in this case area is the feature whose thresholding allows the
clearest discrimination. A threshold can be placed on area such that objects

below the threshold are classified as pedestrian, and those above as vehicles.
The reference standard data is once again a useful source of information for
supporting decisions regarding where to place the decision boundary. The
two classes are Cp and Cv – pedestrians and vehicles respectively. Given an
object area, the aim is to statistically determine to which class the object
is most likely to belong to. In Bayesian statistics this probability is known
as the posterior probability and is denoted as P (Cp |a) – the probability
that an object belongs to the class of pedestrians given area a. The areas
of reference standard vehicles and pedestrians of a 1.5 hour video sequence
were placed into bins of size 400, from areas of 0 to an area of 10000. Using
this information, a graph of the likelihoods of membership of the two classes,
given different areas, is produced. This likelihood is calculated using the
standard Bayesian formula:

P (a|Cp ) × P (Cp )
(Cp |a) = (3.28)
P (a)
where P (a|Cp ) and P (a) are estimated by the following formulae:

# pedestrians in bin
P (a|Cp ) = (3.29)
# pedestrians

# objects in bin
P (a) = (3.30)
# objects
The expression P (Cp ) is the prior probability of an object belonging to
the class of pedestrians. In this case P (Cp ) = 0.31 and P (Cc ) = 0.69, as
more pedestrians cross the scene than cars. These likelihoods are plotted
onto a graph in figure 3.24 for clarity.
Given an area, the class with the highest likelihood should be chosen,
implying that the threshold should be placed where the lines cross (around
3800). However, even with the threshold placed at this optimal point there

Figure 3.24: The likelihood of an object being a pedestrian or vehicle, given
its area

Ped Vehicle
True Ped 4310 16
True Vehicle 301 1648

Table 3.3: Confusion matrix of the basic area-based Bayesian classifier

is significant overlap where misclassifications occur. A confusion matrix[15]

with the threshold at 3800 is provided in table 3.3.
This yields an accuracy of 95.0%. Given the impact incorrect classifica-
tion has on tracking reliability, higher accuracy is required. The single vari-
able Bayesian technique is covered here because of its great simplicity, and
therefore trivial implementation and great execution speed. Unfortunately,
to obtain more a higher accuracy more inputs are needed, and therefore a
more complex model is required.
Subsequent designs are based on a more complex model in the form of
a standard Multi-Layer Perceptron (MLP)[3] neural network. As the name
implies, this neural network has several layers of perceptrons arranged in a

feedforward topology. Each perceptron performs a biased weighted sum of
its inputs and passes this through a sigmoidal activation function. Having
this differentiable activation function – as opposed to a simple threshold –
allows the use of training algorithms such as back propagation. The MLP
design includes a single hidden layer of 4 units, as shown in figure 3.25.

Figure 3.25: The final Multilayer Perceptron design, with a single hidden
layer of 4 units, for classifying an object as pedestrian or vehicle on the basis
of its basic features

As can be seen in figure 3.25, the network has 6 inputs: area, height,
width, AR, Max Speed, Centroid Y. AR refers to Aspect Ratio – as previously
mentioned in section 3.3.2 – and is calculated using equation 3.31.

AR = (3.31)
W idth
Maximum velocity is simply the highest frame-to-frame velocity the ob-
ject has achieved over its lifetime, and centroid Y is simply the Y-coordinate
of its centroid vector.
Before the network can be used it must be trained. Training seeks to alter
the weights between the units such the output error is minimised. To train
the network, 10000 training cases are submitted to the MLP, each consisting

Ped Vehicle
True Ped 7862 30
True Vehicle 58 2567

Table 3.4: The confusion matrix for the MLP-based classifier, with a single
hidden layer of 4 units, on a test set of 10517 cases

of the feature vector of each object and its correct classification. The MLP
was first trained with 100 epochs of back propagation to find the rough area
of the global minimum, with 500 epochs of conjugate gradient descent for
fine-tuning of the weights. 10000 cases were used in the training, and a
further 10517 cases for testing. The training and testing sets are selected as
two contiguous blocks from the total of 20517 cases, rather than randomly
selected from it. This is because from frame to frame objects’ features do not
change significantly. If the training and test cases were to feature alternate
cases in their sets, the similarity of the features could bias the test results.
An MLP with no hidden layers using a cross entropy error function (as was
used in the MLPs described here) effectively performs a logistic regression.
Training and testing as described above on an MLP with no hidden layers
yields an accuracy of 98.0%, considerably higher than that achieved using
only a threshold on area. However, the final design relies on an MLP with a
single hidden layer of 4 units that was found to again improve results.
The confusion matrix for the testing of the final MLP with 1 hidden layer
of 4 units is shown in table 3.4.
Note that the test set for the MLPs is larger than that used for the
Bayesian area-based classification. The accuracy of the classifier is 99.2%,
which is a considerable improvement of the basic Bayesian’s accuracy at
95.0%. This is probably close to the upper limit of performance when only

working with these limited features.
Object classification of an unobstructed object in clear view rarely fails.
Overwhelmingly, failures tend to occur in situations such as occlusion, poor
partitioning and poor segmentation. Cars are sometimes misclassified as
they enter they scene, where they are only partially visible. As pedestrians
exit their vehicle and tracking begins, the car door is often segmented as
foreground together with the pedestrians, causing an elongated silhouette.
This is demonstrated in figure 3.26. The left image shows the input video
image with the MBR of the object superimposed; the right image is the
silhouette resulting from segmentation.

Figure 3.26: A pedestrian misclassified as a car, due to its being segmented

together with an open car door

Figure 3.27: A car misclassified as a pedestrian, due to the front end of the
car being outside the shot of the camera

Figure 3.27 shows a frame in which a car has recently parked at the edge
of the screen. Only the tail end of the vehicle is visible, leading to misclas-
sification. These misclassifications tend to last only a handful of frames at
most, and do not appear to have a detrimental effect on tracking.

3.5 Chapter Summary

This chapter introduced a robust vehicle and pedestrian tracker, capable of
running in real-time on standard PC equipment. The tracking process was
decomposed into a modular format, enabling closer analysis and indepen-
dent development of each module. Initial segmentation provides the tracker
with foreground silhouettes using a standard adaptive background differenc-
ing algorithm. A novel cost function based on Self Organising Maps pro-
vides a measure of how ‘unusual’ a given match is, which helps guide the
search function to the best possible matching. Fragmentation and merging
are specifically addressed and in general are dealt with effectively. It should
be noted that this tracker architecture is designed for relatively sparse scenes
only. Scenes containing heavy traffic, and therefore repeated heavy occlu-
sions, will cause tracking to be very poor. However, this is not the handicap
it may appear to be, since the type of suspicious and criminal behaviour that
the notional overall behaviour classification system is trying to detect will
only occur in relatively sparse scenes.
The following chapter attempts to place a quantitative measure on the
performance of the tracker and compares the performance of different cost
functions and of the search function. Chapter 5 features a conclusion and a
summary of the work, together with suggestions for further work.

Chapter 4

Results And Comparisons

This chapter discusses the creation of the reference standard and the statis-
tics and data gleaned from it. The reference standard has been referred to
throughout this document already, as it came to play a central role in the de-
velopment of the algorithm. Several statistical performance evaluations that
made use of the reference standard have been made already in sections of
this document. These were not included in this chapter since it seems more
natural to discuss these tests within the context of the description of the mod-
ule in question. This chapter will begin with an overview of the reference
standard design, together with a brief discussion of its potential deficiencies.
The following statistical analysis is broken down into an examination of each
SOM component of the cost function, followed by an examination of the cost
modules and greedy search algorithm.

4.1 Developing a reference standard

4.1.1 Overview

By ‘reference standard’, the author refers to a model of what the ‘ideal’

tracker should achieve. Broadly speaking, this should involve hand-marking
sequences of video with the position of objects (together with their type)
on the scene. There are several ways of achieving this, affecting the quality
and features of the final statistical analysis. Any statistical assessment need
to provide simple metrics without bias. Its implementation also needs to
be practicable within a reasonable time frame. The ideal reference standard
would involve hand marking each frame with the objects present, pixel by
pixel. Clearly, this would be too time consuming. Placing the minimum
bounding box around each object is a more feasible proposition, though still
a large task on a video sequence that is long enough to be statistically sig-
nificant. Also, this would only reveal the position of the object. Information
about its true (silhouette) area and histogram would be omitted. These fea-
tures are required to train the SOMs. Another far more problematic factor is
that to assess the difference between the reference standard and the tracked
objects, the system would need to establish a correspondence between the
objects of the two results. Without knowing which reference standard object
refers to the tracked object, the system would have no way of measuring
accuracy. Solving this correspondence problem would be akin to solving the
original temporal correspondence problem that the tracker is trying to re-
To solve these difficulties a new approach is considered. Returning to the
functioning of the object tracker, it seeks to find the ‘best’ match matrix for
a given scene. A reference standard could simply provide a reference stan-

dard match matrix for each frame of a video sequence. The modules of the
tracker system can then be re-used to resolve conflicts (merges/partitions)
and generate the required features of the objects just as the tracker would.
Correspondence between tracker objects and reference objects would be au-
tomatically assured thanks to the match matrix framework (this will be ex-
plained in depth later). However, there are several issues which have to be
resolved for this method to be effective. The first is that for this technique
to work the silhouettes present in the reference standard have to be identi-
cal to those detected by the tracker. Clearly, to provide identical input, the
video sequences are captured for offline use. Three sequences of 1.5 hours
each were captured for the creation of the reference standard. In order to
save space, they are compressed using the lossless PNG compression format.
Lossy compression was found to lead to poor segmentation. Identical input
is not the only issue to creating identical silhouettes, however. Though the
background differencing algorithms are identical in both the tracker program
and the reference standard creation program, different tracking can cause
the reference image to change due to the ‘parked car’ feedback (as described
in section 3.2.3) – producing different difference images. What is needed,
therefore, is a ‘reference standard’ difference image to guarantee that the
difference images used by both the tracker and the reference standard are
identical. To do this, the tracker was adapted to save the difference images
as numbered PNGs – one for each frame of video. The high-level feedback
criteria described in section 3.2.3 were used to determine when an object
should be subsumed into the background. Left to do this automatically, the
system can occasionally incorporate objects too early or too late. User input
was therefore used to determine if feedback is truly necessary. A message
box quizzes the user when the system expects an incorporation, and three

boxes at the bottom of the GUI allow the user to input frame numbers when
this should occur early.

Figure 4.1: The main tracker window, asking the user whether to incorporate
an object into the background. Three input boxes allow the user to input
frame numbers where this should occur early.

The resulting difference images are saved to disk in sequence. These

can then be loaded by the tracker and reference standard creation program
instead of performing background differencing. As well as providing a basis
for the reference standard, this technique of ‘caching’ the difference images
dramatically improves the execution time of the algorithms. An example of
a reference standard difference image is shown in figure 4.2.
The reference standard creation program uses the tracker source code as
a basis, sharing the functions that are common to both. A screenshot of the

Figure 4.2: An example of a reference standard difference image. In this
case, frame 307 of the sequence known as ‘seq003’

reference standard program is shown in figure 4.3.

At a given frame, the reference standard program displays and labels
each silhouette on the difference image. The input video (as well as other
intermediary images) can also be viewed via the View menu. A simple greedy
algorithm, together with the Owens cost function quickly calculates what it
believes is the best match matrix and the resulting MBRs, silhouettes and
object paths. This match matrix is displayed at the top right of the GUI. By
clicking on an entry in the match matrix, an operator can flip its value. The
changes on the resulting objects are displayed immediately within the GUI.
Additionally, the type of object is displayed to the right of the match matrix
and can also be changed with a single click. The ‘silhouette inspector’ and
‘object inspector’ were designed for debugging and experimentation. They
allow objects and silhouettes to be drawn onscreen, costs between them to
be assessed and merges to be performed. A label above the match matrix
indicates the global cost of a given match.
Generally, the greedy algorithm finds the best match matrix and object
types. If an error exists, the operator can correct it and move on to the next
frame. At this point, the best match matrix and the features of the resultant

Figure 4.3: The reference standard creation program

objects are saved on disk for this frame. Even with the help of the greedy
algorithm this is an extremely time consuming process. Assuming a single
frame takes only 4 seconds on average to process, the reference standard for
a sequence of 1.5 hours (at 4 Hertz) will take roughly 24 continuous hours
to produce. The GUI was carefully designed to speed development, with
frequently used buttons appearing near one another and a system to skip
frames devoid of objects automatically.
With a reference standard complete, all that remains is to devise clear,
simple and truly representative methods and metrics for comparison between
the reference standard and tracker results. Unfortunately, the correspondence
problem mentioned earlier still remains.
To illustrate this, consider that the results of the reference standard and

tracker are being compared at frame f . A match matrix corresponds to a
matching between objects and silhouettes. The silhouettes in both cases are
guaranteed to be identical, thanks to the differencing reference standard.
The objects are those which were tracked in frame f − 1, and there may
be differences between the tracker and reference standard. The order of
the objects could be different, a transient noise object may be present on
the tracker and a real object could be missing. There is no clear way of
knowing whether the transient object corresponds to the ‘lost’ real object,
or is simply a transient object. Also, two poorly tracked proximate objects
could be confused for one another. This correspondence problem is solved by
‘loading’ the reference standard objects from frame f − 1, before proceeding
with the tracking algorithm for frame f . This now ensures that objects
and silhouettes are identical for each frame, so that match matrices can be
compared and object correspondence is now automatically established as the
position in the match matrix. For example, if object A is present at f − 1, its
position at frame f in the reference standard can be compared directly with
the tracker’s. It effectively ‘resets’ the tracker to the reference standard at
each frame, and metrics can be measured as to how far it deviates from the
reference standard over the space of this frame. This mode of functioning of
the tracker is known as evaluation mode.
An object is defined as being matched in a given frame if it has been
matched to one or more silhouette(s) in the match matrix. Given an object
A from frame f − 1, there are four possibilities of differences between the
reference standard and tracker results – which affect the action taken. These
are shown in table 4.1.
‘Lost’ objects refer to objects which were matched in the reference stan-
dard but were lost by the tracker. ‘Extra’ objects are objects which have

Reference Standard Tracker Result
Matched Matched Distance between objects measured
Matched Unmatched Object is flagged as ‘lost’
Unmatched Matched Object is flagged as ‘extra’
Unmatched Unmatched Correct tracking. No stats taken

Table 4.1: Table of all 4 possible tracker/reference standard match combina-


disappeared in the reference standard, but were matched to a silhouette in

the tracker. These are often the results of nearby transient noise that was
a close enough match to fool the tracker into believing the object was still
present. For a given sequence of frames, the number of extra and lost objects
provide a useful measure of the accuracy of the tracker.
Both ‘lost’ and ‘extra’ objects operate relative to a reference object which
was present on the previous frame. Some spurious objects are created where
no reference standard object existed in the previous frame. In an empty
scene, for example, camera judder can cause the creation of these ‘orphan’
objects – so named because they have no parent reference standard object
to which to refer. An orphan object is defined as an object newly created
by the tracker in the current frame, and to which no newly created refer-
ence standard object’s centroid is an exact match. Poor tracking causes an
increase in the number of orphan objects in a sequence, since poor tracking
can incorrectly leave large silhouettes unmatched – which in turn generate
orphan objects.
One of the most useful metrics is the measure of the distance between
reference objects and tracker objects. These are arranged into bins, at inter-
vals of 5 pixels, for statistical simplicity. A second and final measure of the

quality of a match is measured in ‘flips’. This is the number of match matrix
entries which differ from the reference standard, for a given object. The more
flips, the worse the match. In terms of features this is a fairly rough measure,
since a single flip’s impact on the final object’s features can differ greatly.
For example a match could be missing only a tiny, insignificant silhouette; or
it could be missing its main body. Both of those situations result in a single
flip. It does however provide an insight as to how close the search function
came to the true match matrix, and therefore to the reference standard.
In summary, the following statistics are gathered:

• Number of ‘lost’ objects

• Number of ‘extra’ objects

• Number of ‘orphan’ objects

• Distance of tracker object match from the reference standard

• Number of flips of tracker match matrix from the reference standard

4.1.2 Potential Deficiencies

There are several deficiencies in both the method of defining a reference stan-
dard, and in the statistical metrics. The goal of these statistics is to mea-
sure the performance of the tracker as a whole, and in particular the object
matching module. It makes no attempt to measure the quality of the object
segmentation step, yet even the reference standard is heavily influenced by
its quality. This can lead to situations in which the reference standard is
sub-optimal – occasionally to the point of being unusable. Figure 4.4 is an
example of one of these situations, where a pedestrian standing by their car

has been poorly segmented. This appears to be due to his/her own shadow
or reflections on the car bodywork.

Figure 4.4: A poorly segmented pedestrian. Whichever match matrix is

chosen for the reference standard, the result will always be unsatisfactory

Similar situations occur when partitioning fails, such as during heavy

occlusion as seen previously in figure 3.20. These frames are therefore ignored
with the help of an ‘ignore list’, a small text file associated with each sequence
consisting of several pairs of ranges defining which frames to ignore when
reading the reference standard. As few frames as possible are ignored – it
is used only when segmentation failure is severe. Cases where the feature
distortion is reasonably minor are kept in the reference standard. If all
such cases were discarded, the reference standard (and therefore the SOM
training) could become slanted towards representing only well segmented
objects. However, the SOMs trained on this data need to model both well and
poorly segmented objects to ensure that poorly segmented object matchings
are not considered ‘novel’ when tested on new data. For example, pedestrians
exiting their cars will be poorly segmented due to the car door being included
as foreground. Features such as width and area are skewed by this. Yet these
remain in the reference standard to allow the SOMs to model this normal
Here is a sample ‘ignore list.txt’1 :
This was taken from seq003, the selection set

9 -> Number of entries
99,210 -> 1st range
3289,3297 -> 2nd range
4025,4057 -> 3rd range
4586,4594 -> 4th range
4854,4885 -> 5th range
8451,8460 -> 6th range
11050,11090 -> 7th range
15809,15886 -> 8th range
17146,17146 -> 9th range

When the tracker reaches a frame that equals any of the first frame num-
ber in any pair, the tracker immediately skips to the second frame number
(+1) in the pair.
One of the features of the method of evaluation is that the system mea-
sures accuracy only frame by frame – it resets to the reference standard every
frame rather than allowing the tracker to err for several frames. Whilst this
is a reasonable measure of the performance of the tracker, it does not mea-
sure its ability to recover from errors. Also, the only object feature measured
is the distance from the reference standard centroid. Other feature differ-
ences, such as area, height and width, which could have a large effect on
future matchings, are not measured. Therefore, the tracker could match a
silhouette whose position is close, but all other features are totally incorrect.
According to the metrics, the match would still appear to be a good one but
would ultimately result in a poor match in later frames if the system were not
resetting to the reference standard. To address this, a measure of the other

features could be taken. However, this would add a great deal of complex-
ity if each feature difference were measured. Any attempt to condense this
feature difference into a single metric throws up the same issues as the cost
function that we are trying to analyse. Instead, qualitative analyses of the
tracker’s performance, not in evaluation mode, were made to ensure that the
tracker’s performance statistics were not skewed. It was found that where
errors are made by the tracker, these are soon corrected. Generally, using
the object’s centroid alone as a measure appears to be very accurate. If the
centroid is correct, generally the other features are also very similar since
it implies that the same silhouettes were matched. Also, the ‘flips’ measure
provides a complementary method of measuring ‘closeness’ to the reference
Another issue, though not by the fault of the reference standard, is that
of the computational time taken to perform these evaluations. Whilst assess-
ment of the greedy algorithm based tracker can be performed offline faster
than real-time, the exhaustive algorithm in one case requires up to a week of
processing time. Clearly, in a practical environment where software tweaks
must be made and the tests run several times this is a severe handicap. As
previously mentioned, the complexity of a given frame using the exhaustive
algorithm is 2m – where m is the number of ‘1’s in the valid match matrix.
A complexity analysis of each scene, consisting of noting the complexity of
each frame, reveals that removing only a handful of frames can reduce the
running times of tests by several days. Despite the slight loss of data, a
pragmatic decision was made to remove these few frames to enable the test-
ing to proceed considerably more swiftly. Removing only two frames reduces
the computational load by a third. Many of the most complex frames are
within scenes which had already been removed due to very poor segmenta-

tion. Only a single extra frame has been removed explicitly to reduce the
load. In total, this reduces the computational load (for seq003, the first of
the test sets) from 3,024,822 to 1,708,753 iterations. In total, 325 frames
were removed from a total of 25000. This example uses figures from seq003,
one of the three sequences. A similar ‘ignore list’ exists for each of the other
two sequences. ‘seq002’ is the sequence that was used to train the SOMs.
Sequences ‘seq003’ and ‘seq007’ make up the two test sets. Clearly, other se-
quences were available. These three sequences were chosen due to the good
conditions on the day. Other video sequences are marred by very high winds
(camera judder), raindrops on the camera, camera movements and unusual
lighting conditions such as fast lighting changes. Improving the tracking on
those sequences would involve an improvement in design of the camera and
stand, or of the object segmentation – neither of which is under considera-
tion in this section. However, some of those difficulties could be mitigated by
switching the object segmentation module to a mixture of gaussians based
system as discussed in section 3.2.1.

4.2 Statistical Analysis

This section begins by taking a closer look at the individual performance
of the SOMs which constitute the cost function, followed by an analysis of
the overall performance of the tracker. In particular, the performance of the
SOM-based cost function is compared to that of the Owens cost function,
and the quality of the greedy search algorithm is assessed.

4.2.1 Individual SOMs

As covered in section 3.3.2, the SOM-based cost function consists of three

SOMs: the motion SOM, the comparative SOM, and the appearance SOM.
In order to test the basic premise that the SOMs are able to learn to model
‘normal’ object motion, feature differences and appearance, each SOM is
presented with artificial situations. These situations help demonstrate the
SOMs suitability for the task, and – from a development perspective – are
an excellent way to understand the inner workings of the cost function and
why it performs so well.
The first SOM is designed to assess the novelty of an object’s path. To
test the sensitivity of the system to usual and unusual paths, a normal ob-
ject trajectory was sampled from the reference standard and altered. The
object was matched to 40 different artificially generated silhouettes, each at
a different angle. If the SOM is to be a true reflection of the norm, common
directions of movement should yield a low cost. Shown in figure 4.5 are two
series, both depicting a pedestrian travelling down the scene. The graph
depicts cost versus the angle of the silhouette match for both pedestrians.
The angle is measured clockwise starting from the horizontal heading to
the right (’east’). Example B appears to be almost ‘too perfect’ to be a true
result. This was the first result obtained, and is naturally a cause for concern
that there may be errors in the testing methods. After a review of testing
methods and many tests on different objects and areas of the screen, it has
been noted that it is a valid result and that this type of sinusoidal output is
not particularly uncommon.
In both examples, pedestrians generally travel down the scene (‘south’)
so one should expect to see low costs around the 2
radian mark. The true
paths taken on this frame by the pedestrians are marked by the vertical line

Figure 4.5: Performance of the motion SOM, given a specific point and speed
but different angles of motion.

labelled ‘true angle’. Example B’s sinusoidal shape might be explained by its
great similarity to some of the training cases. Of course these two examples
are lifted from sequences that were not used to train the SOM in order to
present a true test, they are taken from what will later be referred to as the
‘selection set’ ( ‘seq003’ ). In series B it is very uncommon for pedestrians to
travel upwards in the scene, and this is reflected in the single dip at roughly
. Example A is taken from a less frequently visited part of scene, less well
represented by the training data, and commonly has people travelling up and
π 3π
down the scene at this position. The graph clearly has two dips at 2
and 2
It can be seen, however, that the cost for people travelling up the scene is
lower than down the scene. This is contrary to expectation, since one might

reasonably expect the cost function to be biased by the fact that the short
term memory window wt (dy) of the object indicates a ‘southerly’ direction.
The SOM should generally expect a continuity of direction – that is that
the direction of instantaneous motion should roughly agree with the recent
window generated by the window function wt . It is possible this is caused
by the simple fact that more pedestrians have travelled north in the training
data, and travelling south is considered slightly more unusual. Although this
type of testing is somewhat limited, it does reveal that the SOM functions
as intended.
As well as topologically learning the likelihoods of certain motions on the
scene, this SOM learns that some areas of the scene are simply inaccessible to
pedestrians and that matchings to these areas should be discouraged. These
areas might included busy road junctions, areas with obstacles or parking
areas. Even within relatively open areas, pedestrians tend to follow similar
well-defined paths. This tendency is one of the reasons that this SOM is able
to model the paths so well, and so accurately distinguish between normal
and abnormal movement.
The second SOM (known as the comparative SOM) is designed to roughly
model the function of the Owens cost function, yielding a measure of the
difference between two feature vectors with the ability to become biased in
different ares of the scene. Two vehicles are used in the following example.
Vehicle A is in the centre of the scene, ready to park. Vehicle B is entering
the scene from the south, and is only partially visible. To test the output of
the comparative SOM, the pArea input to the comparative SOM is altered
to values ranging from -2 to 2 and the output plotted onto a graph. All
other inputs to the SOM – X,Y, dX, dY, pHeight, pWidth, dHist – remain
identical to the true values for each vehicle. As one would expect, for vehicle

A the output is lowest where pArea is zero (no change in area). However,
for vehicle B the output is biased to expect slightly higher values of pArea,
where the SOM has learnt to expect the area of vehicles to increase in area
rapidly as they enter a scene. This can be seen in figure 4.6.

Figure 4.6: Performance of the comparative SOM, testing the effect of chang-
ing pArea on the output cost. Vehicle A is in the centre of the scene, about
to park. Vehicle B is just entering the scene.

It is this ability to specialise to different areas of the screen that gives this
type of cost function such a clear advantage over a more basic cost function.
Here, only pArea is tested. Other tests conducted involving changing other
inputs, such as pHeight, pWidth and dHist have similar outcomes. The out-
put curve is often approximately a quadratic, or a triangular shape, similar
to that in figure 4.6.
The third SOM, the appearance SOM, is designed to provide a measure of

the normality of an object’s appearance. Though initially counter-intuitive,
the appearance of objects is strongly affected by the position in the scene.
The simplest effect is that distance leads to smaller objects. More subtly, oc-
clusions between objects can cause a change of appearance. As objects enter
the scene, part of the object is occluded and the system learns to model this
appearance. To examine the SOM, 2 reference standard pedestrian matches
were tampered with by changing only the Aspect Ratio (AR) input, whilst
keeping X, Y, Area, Height and Width identical.

Figure 4.7: Performance of the appearance SOM, testing the effect of chang-
ing aspect ratio on the output cost. Pedestrian A has just exited his/her
vehicle. Pedestrian B is in an unobstructed area of the scene.

Pedestrian A has recently exited his/her vehicle, and is now standing

behind his/her vehicle so that only the head and shoulders are visible. This

Figure 4.8: The two pedestrians used to capture the data for figure 4.7

is often the case in this area of the scene, as drivers parking in this spot
will always exit the vehicle on the far side of the car. Looking at figure
4.7, the curve for pedestrian A shows that an aspect ratio close to 1:1 is not
unexpected and incurs a relatively low cost. Pedestrian B, on the other hand,
is in an open space and is clearly visible. This area is very rarely obstructed,
and all of the pedestrian is visible. In this situation, pedestrian B in figure
4.7 shows a clear preference for pedestrians that have an aspect ratio just
over 2.5.
Here aspect ratio was chosen, but a similar specificity to areas of the
scene is noticeable with all other input features. Once again, the SOM has
accurately learned the ‘normal’ features of objects in the scene.

4.2.2 Overall performance

Chapter 3 introduced the architecture of the tracking system, whilst em-

phasizing its modular nature. The testing and statistical analyses presented
here make use of this modular design and the interchangeability of these
modules to assess the performance of the modules. There are two search
algorithm modules: the exhaustive search function, and the greedy search
function – both presented in section 3.3.5. There are also two cost functions
under examination: the Owens cost function, and the SOM cost function –

introduced in section 3.3.2. With these interchangeable modules, there are
therefore four potential combinations to test.
As mentioned earlier, the reference standard is constituted of three video
sequences of roughly 1.5 hours of footage each. The first was used to train
the SOM cost function, is referred to as the training set and plays no further
role in the testing of the system. The second (‘seq003’), was initially intended
as a test set. However, it was often used to test the performance of various
SOM configurations, greedy algorithm, and other modules. The very act of
perfecting the algorithm on a test set could clearly skew the results to suggest
a high-performance algorithm. Therefore, this set is henceforth referred to as
the selection set, with the final sequence (‘seq007’) considered the true test
It has been noted several times during testing that matching objects near
the edge of the scene can be considerably more difficult, due to being partly
out of shot and their features therefore being more changeable from frame to
frame. To confirm whether this causes a significant effect, distance and ‘flips’
statistics are gathered separately for objects touching the edges of the scene,
to those that are more central. Surprisingly, there is no noticeable effect,
with the greedy SOM-based tracker (the final design) matching 99.55% of all
edge objects to within 5 pixels of their true position, and 99.1% of central
objects to within 5 pixels (using ‘seq003’, the selection set). Knowing that
edge objects have similar accuracy is useful when considering the design of
the cost function, such as the possibility of a higher value for γ (the cost of
an unmatched object) for edge objects. These statistics help to confirm that
this is not necessary. Therefore this differentiation between edge objects and
central objects will no longer be made, in order not to obfuscate the issue.
The following statistical analyses are gathered using ‘seq007’, the test

set, with a total of 5785 reference standard object-frames. The following
graph in figure 4.9 shows the results of tests on all four combinations of
tracker modules mentioned earlier2 . The distances of tracker matches from
the reference standard are measured in pixels, and arranged into bins at 5
pixel intervals. The Y axis is the percentage of all object frames which were
matched by the tracker to within the distance (from the reference standard)
marked on the X axis.

Figure 4.9: The percentage of object matches within a distance X of the

reference standard, using the four combinations of cost and search function
modules in the test set

From figure 4.9, it is clear that the greatest factor in performance is the
choice of cost function – with the SOM cost function clearly outperforming
the Owens function. The top two performing combinations are the greedy
As mentioned in section 3.3.5, the values of γ and δ are set to different optimal values
for greedy and exhaustive modules.

and exhaustive SOM combinations, with very little performance differential
caused by the choice of exhaustive or greedy search function. Given that
exhaustive checks all possible combinations of match matrix, and the greedy
examines only a small portion of it, one would expect the exhaustive function
to outperform the greedy algorithm. In designing the greedy function, the
aim is to dramatically cut the computation time whilst reaching a level of
performance that is as close as possible to the exhaustive algorithm. This
data suggests that the performance of the greedy algorithm is almost indis-
tinguishable from that of the exhaustive, with the greedy even showing a
slight edge over exhaustive at the 0 pixels (exact match) distance – using
both Owens and SOM cost functions. Whilst the greedy algorithm performs
best at very short distances, the exhaustive algorithm seems to be more adept
at ‘ruling out’ matches that are distant from the reference standard object.
This is most evident with the Owens cost function, where the exhaustive
algorithm tracks 99.72% of objects to within 15 pixels, compared to 98.83%
for the greedy algorithm. This may seem a minor difference, but this means
that over 3 times as many objects are classified outside the range of 15 pixels.
The average match distance gives a more succinct summary of the per-
formances based on distance, as shown in figure 4.10.
These results reinforce the difference in performance of the Owens cost
function in comparison to that of the SOM-based cost function. Again, the
greedy search module has a similar performance to that of the exhaustive –
particularly in the case of the SOM cost function. This could suggest that as
the cost function improves, the difference in performance between exhaustive
and greedy search function decreases. It is possible that this is because the
assumptions upon which the greedy algorithm are based may hold ‘more true’
as the cost function improves – such as the assumption made by the initial

Figure 4.10: The average match distances, comparing the Owens/SOM cost
functions, and exhaustive/greedy search modules in the test set.

Naı̈ve match. As was mentioned in section 3.3.5, this initial step was found
to select a silhouette which is indeed part of the reference standard 98.7%
of the time for the SOM cost function, compared with 96.6% for the Owens
cost function. However, the assertions regarding narrowing of performance
difference between greedy and exhaustive as the cost function improves must
be made tentatively, in light of the relatively small amount of supporting
The second measure of the similarity of the tracker match to that of the
reference standard, is the number of ‘flips’ – the number of entry differences
in the match matrix for each object. These results, seen in figure 4.11, are a
strong reflection of the results obtained using match distance in pixels. At a
distance of zero pixels, this is understandable since only a perfect silhouette
match (zero flips) is likely to produce an exact centroid match (zero pixels).

Figure 4.11: The percentage of object matches within X flips of the refer-
ence standard match matrix, using the four combinations of cost and search
function modules

Again, the major performance factor is the choice of cost function; the
greedy algorithm produces results that are similar to that of the exhaus-
tive module. Clearly, the SOM cost function outperforms the Owens cost
The number of ‘lost’ and ‘extra’ objects for all module combinations is
shown in figure 4.12.
In this case, both the choice of cost function and choice of search al-
gorithm have an impact in terms of ‘lost’ objects. The exhaustive/Owens
combination in fact slightly outperforms the greedy/SOM in this case. In
terms of ‘extra’ objects, the SOM-based cost function once again clearly out-
performs the Owens cost function. Interestingly, using the exhaustive search
function appears to slightly increase the number of ‘extra’ objects. This is

Figure 4.12: The number of ‘lost’ and ‘extra’ objects through the test se-
quence, using the four combinations of cost and search function modules

caused by the very nature of the exhaustive algorithm, with the problem be-
ing mitigated in the case of the greedy algorithm because it limits the search
to a sensible subset. A simple example shows how this situation can arise.
At time t − 1 two pedestrians are walking near the edge of the screen, as
two separate silhouettes, and are about to disappear offscreen. At time t,
only one of the pedestrians has disappeared. In this case, the greedy tracker
would match both objects to the silhouette – causing a match conflict. The
resulting macro object comparison would conclude that the cost of the macro
object match to the silhouette is higher than that of the single object match
– therefore the object would be lost correctly. In the case of the exhaus-
tive tracker, it may have discovered that partitioning the silhouette in two is
‘cheaper’ than paying the cost of γ added to the cost of the correct silhou-
ette match. This would result in the disappeared pedestrian being incorrectly
tracked when it should have been lost – causing an ‘extra’ object. This could

be considered a fault in the global cost model, or a strength of the greedy
algorithm’s macro object cost comparison system.
The final measure involves the number of ‘orphan’ objects – extra objects
which have no parent reference standard object – shown in figure 4.13. It
should be noted that this figure is so high due to ‘camera judder’ in heavy
winds causing many large noise silhouettes and thus the temporary instanti-
ation of fake objects (probably around 400 or so). No matter how excellent
the performance of the tracker is, the vast majority of these ‘orphans’ will
remain. These distant, temporary orphans cause no damage to the quality of
the tracking. It is only those created near true objects that cause a tracking
disruption, and it is these which better tracking is able to reduce.

Figure 4.13: The number of orphan objects through the test sequence, using
the four combinations of cost and search function modules

Once again, the SOM cost function clearly outperforms the Owens cost
function. The greedy algorithm performs well here, with only a slight drop
in performance in comparison with the exhaustive algorithm. Once again,
an improvement in the quality of the cost function appears to have narrowed

the gap in performance between greedy and exhaustive approaches.
Out of curiosity, the effect of the δ (the cost of instantiating a new object)
in the cost function has also been measured statistically. This has already
been done extensively on the selection/training sets, in conjunction with
simply assessing the quality of tracking ‘by eye’, to set δ to its optimum
value. Testing on the test set could validate these results. Using the final
tracker (greedy SOM), the results from the tracker with δ set to 0.0 yield
results with a far higher incidence of orphan objects: 457 compared to 411
with the standard settings. The average match distance also rises from 0.218
to 0.247 pixels. These results appear to justify the use of the δ in the cost
function as a computationally cheap way of substantially improving tracking.
So far, only the results from the test set have been reviewed. The results
from the selection set can of course also provide useful data, provided it is
borne in mind that the results may be slightly tainted by their use in selecting
the algorithms which now constitute the SOM cost module and the greedy
algorithm. All of the results derived from the selection set are extremely
similar to those obtained from the test set. The SOM cost function performs
extremely well, particularly in comparison to the Owens function; and the
greedy function produce results that are comparable with the exhaustive
function. In fact, even the greedy algorithm’s slight ‘edge’ over the exhaustive
function at a match distance of zero is seen in the selection set, as shown in
figure 4.14.
The results even appear to lend support to the theory that the perfor-
mance difference between greedy and exhaustive narrows as the cost function
improves, as can been seen in figure 4.15.
Finally, a graph of the extra/lost/orphan objects resulting from the four
configurations evaluated on the selection set is shown in figure 4.16. Note

Figure 4.14: The percentage of object matches within a distance X of the
reference standard, using the four combinations of cost and search function
modules in the selection set.

that here the number of orphans is considerably lower – this sequence was
subjected to considerably less camera judder than the test set. All of these
results support the conclusions drawn from the test set, including the result
showing that the exhaustive algorithm increases the number of ‘extra’ objects
in comparison to the greedy algorithm.
As covered in section 4.1.2, though all of the above statistics are an ex-
cellent estimate of real world performance, there is one factor which is not
directly measured by these statistics: the ability of the system to recover from
errors. This is due to the frame by frame nature of the evaluation system.
Being the only major concern of the evaluation system, the performance of
the tracker in its normal mode of operation has been tested throughout devel-
opment of the tracker system. One of the most notable features of the SOM
cost function is in fact its ability of recover from errors. This has already

Figure 4.15: The average match distances, comparing the Owens/SOM cost
functions, and exhaustive/greedy search modules in the selection set.

been described in section 3.3.3, where it was noted that during extended
periods of partitioning the errors can mount up frame by frame, since the
partitioning imprecisions from the previous frame affect the object features
used in the current frame. However, the SOM cost function tends to redress
errors towards the standard appearance and movement of an object. This
is not the case for the Owens cost function. The same effect is visible after
tracking errors occur during normal operation.
As well as slight positional errors in tracking from frame to frame, the
initial instantiation of objects can occasionally cause problems. In the fol-
lowing example, a pedestrian standing behind a car has recently emerged
from partial occlusion. Due to his jacket colour matching the colour of the
car in the background, his legs are segmented individually. The tracker has
therefore continued to track the head and torso as was visible before, and
the legs have instantiated a new object. This frame, at time t, is shown in

Figure 4.16: The number of lost/extra/orphan objects in the selection set,
using the four combinations of cost and search function modules.

figure 4.17, with the input image on the left and the difference image on the
The frame that follows this example (time t+1) remains almost identical.
In the following two frames (t+2 and t+3) the fragmented silhouette merges
together again to form a single silhouette. The tracker awkwardly partitions
this silhouette in an attempt to keep track of the legs separately. At time t+4
the tracker loses tracking of the legs object, and segments the pedestrians as
a whole as it should have been originally. The view at time t + 4 is shown in
figure 4.18.
This type of error recovery is quite typical, though it does not always
occur. Objects can occasionally remain fragmented for a while longer, though
very rarely for the entire duration of their existence. In the above case and
many cases in general, tracking is only slightly disturbed before the error is
corrected. In the final version of the tracker there is also a low minimum

Figure 4.17: A pedestrian emerging from partial occlusion is fragmented,
yielding a new orphan object created by the silhouette of the legs. The video
image with MBRs is on the left, with the difference image on the right.

Figure 4.18: Four frames after the example illustrated in figure 4.17, the
tracker has corrected the original segmentation error.

bound on the size of objects (50 pixels), below which they are culled. This
helps to remove objects which have clearly been poorly tracked.
The inverse situation to that just described can also occasionally occur.
Several objects standing close together can enter the scene as a single silhou-
ette. In this situation, they are tracked as a group. If the group separates,
the tracker will eventually track the objects individually. An example of
pedestrians tracked as a group is shown in figure 4.19.
Aside from the tracking performance, only one other small problem was
noted during qualitative evaluation. In general, the algorithm is able to
perform tracking in real-time with relative ease on a Pentium 4 (3.0 Ghz)

Figure 4.19: A couple of pedestrians tracked as a group.

system. However, when several objects are being tracked and high winds
cause judder (as previously described) many silhouettes are created. In one
frame that was tested, 98 silhouettes were created with 3 objects present
onscreen at the time. This leads to a freezing of the system, where it takes
several minutes to process the frame. These camera judders tend to occur
only for a single frame, and a maximum of perhaps three frames. By simply
skipping those frames containing more than 30 silhouettes, the problem can
be avoided with no significant impact on tracking. Skipping a small number
of frames seems to have no serious effect on tracking.
Finally, it was noted during section 3.3.2 that the very design of the
SOM cost function could have a negative impact on the quality of the track-
ing during novel behaviour. Given that this tracker is designed specifically
with behaviour classification in mind, this was carefully investigated. By
definition, truly novel behaviour is rare – so an ‘actor’ was filmed exhibiting
strange behaviour in the car park. A snapshot of this roughly 40 minute
sequence is shown in figure 4.20.
The ‘actor’ walks across the car park at unusual angles, dashes between
cars and hides behind cars. Other suspicious activities include dawdling
around cars and visiting several cars in series in one trajectory. The suspi-

Figure 4.20: An ‘actor’, walks around the car park suspiciously without any
apparent ill effects on the quality of the tracking.

cious behaviour appears to have no ill effects on the quality of the tracking.
Unfortunately, the brightness of the actor’s trousers is almost identical to
that of the tarmac, causing a great deal of silhouette fragmentation. This
does not pose a significant problem to the tracker, which is generally able to
merge the fragments with no noticeable effect on the quality of the tracking.
The only minor difficulties encountered involve the repeated disappearance
and occlusions of the ‘actor’ behind cars, after which two separate silhouettes
sometimes emerge, causing a situations similar to that demonstrated earlier
in figure 4.17. This combined issue of poor segmentation and repeated oc-
clusion is itself unrelated to the issue of novelty being a threat to the quality
of tracking.
This chapter began by covering the methods used to generate the refer-
ence standard, and critically appraised the strengths and weaknesses of this
approach to statistical analysis. The general concepts of the SOM cost model
examined in greater depth in section 4.2.2, provide a useful insight into their
behaviour and conclude that the individual SOMs are indeed well able to

model the specificities of the scene they were trained on. A detailed statisti-
cal analysis was carried out in section 4.2.2 (this section), which demonstrates
how effective the SOM cost function is – particularly in comparison to the
simpler Owens cost function. The greedy algorithm has also been found to be
extremely effective, with results that are very similar to those produced by
the exhaustive approach – in some cases the results are even superior. This
was followed by a brief look at the issues which may not be fully covered by
the statistical analysis. In particular, the presence of suspicious activity does
not appear to have any negative impact on tracking.

Chapter 5


5.1 Objectives and Achievements

To paraphrase the objectives listed in section 1.3, the central objective was to
create a real-time tracker capable of robustly classifying and tracking pedes-
trians and vehicles on a static CCTV scene. Chapter 1 set out the context
within which this is to be done, that is to say that it must be within the
framework of a larger activity classifying system – such as that produced by
Owens. This requires that the centroids and class (pedestrians or vehicle)
be tracked in real-time, which has clearly been demonstrated to have been
achieved. A survey of the relevant fields was covered in chapter 2, with a
critical appraisal of the techniques which might be applied to reach the ob-
jectives. Chapter 3 introduced the tracker architecture, breaking the system
down into several modules and introducing the concept of a search space and
a global cost function. Each module was then examined in turn, with de-
sign decision justifications and the potential shortcomings of each subsystem
reviewed. Amongst the modules, the novel SOM cost function is perhaps
the most important, presenting a new approach to measuring cost in the

field of tracking. Chapter 4 introduced the concept of the reference standard
evaluation, together with the results of this, in order to provide a statistical
backbone to the claim that the tracker is indeed robust (as required by the
objectives). It has been demonstrated that the SOM cost function is able to
model the specificities of a given scene, and provides results which are clearly
superior to a more traditional cost function (in this case the Owens cost func-
tion). The greedy algorithm was also shown to have a similar performance
to the simpler (but unfeasible in real-time) exhaustive search – in some cases
even exceeding its performance. Many of the concerns which may not have
been addressed by the statistical evaluation were then discussed, such as the
system’s ability to recover from errors and that the tracking quality does not
suffer from tracking suspicious objects. In summary of the statistical evalu-
ation, the final tracker (greedy SOM) tracks 99.27% of objects to within 5
pixels of their reference standard centroid, with a minimal number of spurious
objects being created and very few objects whose tracking is incorrectly lost.
The object classifier also enjoys a high level of accuracy, correctly classifying
99.2% of all objects.
Having quantitatively demonstrated the robustness of the system, all ob-
jectives outlined in section 1.3 have been met.

5.2 Further Work

Throughout the development of this system, it has been noted that this is
an extremely fertile area of research with many as yet unexplored avenues of
research. Starting with some of the most obvious, the object segmentation
module could be replaced with a mixture of gaussians based system, in order
to make the system more resistant to waving foliage and camera judder.

As has been covered, this is only feasible in the presence of more powerful
or specialised hardware. The insertion of cars into the background could
be achieved using the method of replacing one of the gaussians with one
centred around the mean of the new background pixel value at just above the
threshold level (as described by Harville). Another potential area of research
is the examination of objects as they enter the scene. At present, silhouettes
above a certain size are instantiated – this can have negative effects if an
object enters the scene already fragmented into several silhouettes. Some
form of appearance analysis as objects are instantiated could help mitigate
this effect.
Moving on to the tracking of objects, pyramidal optical flow (as described
in section 3.2.1) provides a way of tracking the motion of pixels across frames.
It could be possible to use this information to improve the matches between
objects and silhouettes. For example, three points on each silhouette consti-
tuting the object at time t − 1, could be computed forwards to time t. The
object could then be matched to all those silhouettes containing at least two
of those points. This technique could be tested in combination with the cost
function, or as a standalone technique. Whether this technique would be fast
enough to be computed in real-time is an open question.
An alternate approach to the search algorithm could make heavy use of
the macro object function. During testing, it was noted that this macro
object approach is extremely adept at identifying true object merges, and
at doing so with very little computational overhead. The ‘cheapness’ of this
function is due its approach of merging the object to assess cost (cheap),
rather than partitioning the silhouettes (expensive). An alternative search
function could build upon this, starting by merging all objects in the scene
and assessing the costs of all possible silhouette combinations. With the

silhouette subset found, one object could be removed and the new subset
found. By taking the difference of the subsets, the silhouettes constituting
the removed object might be found. By repeating this process iteratively, the
algorithm could produce an accurate object-silhouette matching relatively
cheaply (roughly below 2#silhouettes × #objects operations). For the final
step, and objects which are truly merged, the standard greedy algorithm
could be used.
A rather more predictable improvement which could be made is the im-
plementation of a Kalman filter, not only on the centroid of each object, but
also to smooth other features from frame to frame. It is unclear whether this
would have a positive impact on tracking, but it would be a compelling area
of research.
In order to further test the quality of the SOM cost function, an entirely
different scene could be used to train and test the SOM architecture. Un-
fortunately, the process of hand-marking image with the reference standard
is an extremely time-consuming process, which is why this system was not
tested on a second scene. One of the disadvantages of this system is that it
is necessary to train the system on reference standard data. However, fur-
ther work could investigate the possibility of training the SOM cost function
using ordinary tracking data (and the Owens cost function) – once pruned of
all merges/poor segmentation frames and so on. This data may prove to be
‘good enough’ to provide a solid cost function, and help alleviate the burden
of the necessity of a reference standard.
It is believed that this body of work provides an excellent basis for further
works, such as those which have just been described.


[1] J.L. Barron, D.J. Fleet, S.S. Beauchemin, and T.A. Burkitt. Perfor-
mance of optical flow techniques. CVPR, 92:236–242, 1994.

[2] A. Baumberg and D. Hogg. An adaptive eigenshape model. In Proc of

the 6th British Machine Vision Conference, Vol 1, pp 87-96, 1995.

[3] C. M. Bishop. Neural networks for pattern recognition - Ch. 4. Oxford

University Press, 1995.

[4] J.-Y. Bouguet. Pyramidal implementation of the lucas kanade fea-

ture tracker: Description of the algorithm. Technical report, In-
tel Corporation Microprocessor Research Labs, 2000. Available in
the OpenCV documentation or at∼bregler/

[5] B. Brown. Cctv in town centres: Three case studies. Home Office
Police Research Group: Crime Detection and Prevention Series, 1995.

[6] C. Donald (editor). Ergonomic considerations in cctv viewing.

Ergonomic Considerations in CCTV Viewing - May, 1998, Vol
4 No 3., 1998.

[7] X. Gao, T.E. Boult, F. Coetzee, and V. Ramesh. Error analysis of back-
ground adaption. In IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pp503-510, Hilton Head Island, South
Carolina, June 13, 2000.

[8] I. Haritaoglu, D. Harwood, and L.S. Davis. w4 :who? when? where?

what? a real time system for detecting and tracking people. In Interna-
tional Conference on Face and Gesture Recognition, April 14-16, 1998,
Nara, Japan, 1998.

[9] M. Harville. A framework for high-level feedback to adaptive, per-pixel,

mixture-of-gaussian background models. In Proceedings of the 7th Euro-
pean Conference on Computer Vision, vol. 3, pp. 543-560, Copenhagen,
Denmark, May, 2002.

[10] M. Harville, G. Gordon, and J. Woodfill. Foreground segmentation using

adaptive mixture models in color and depth. In Proc. of IEEE Workshop
on Detection and Recognition of Events in Video, July, 2001.

[11] A. Hunter, J. Owens, and M. Carpenter. A neural system for automated

cctv surveillance. In IEE Symposium on Intelligent Distributed Surveil-
lance Systems, ed. S. Velastin, 26 Feb. 2003, IEE Savoy Place, London,
IEE London, ISSN 0963-3308, 2003.

[12] O. Javed, S. Khurram, and M. Shah. A hierarchical approach to robust

background subtraction using color and gradient information. In IEEE
Workshop on Motion and Video Computing, Orlando, Dec 5-6, 2002.∼vision/papers/javed wmvc 2002.pdf.

[13] O. Javed and M. Shah. Tracking and object classification for automated
surveillance. In Computer Vision - ECCV 2002, 7th European Con-

ference on Computer Vision, Copenhagen, Denmark, May 28-31, 2002.∼vision/papers/trackECCV02.pdf.

[14] P. KaewTraKulPong and R. Bowden. An improved adaptive background

mixture model for real-time tracking with shadow detection, 2001.

[15] R. Kohavi and F. Provost. Machine Learning - Glossary of Terms, 30.

Kluwer Academic Publishers, 1998.

[16] D. Koller, J. Weber, and J. Malik. Robust multiple car tracking

with occlusion reasoning. In ECCV (1), pages 189–196, 1994. http:

[17] B.D. Lucas and T. Kanade. An iterative image registration technique

with an application to stereo vision. In IJCAI81, pages 674–679, 1981.

[18] P. S. Maybeck. Stochastic models, estimation, and control, volume 141

of Mathematics in Science and Engineering. Academic Press, Inc, 1979.∼welch/kalman/maybeck.html.

[19] Office of the Deputy Prime Minister. Urban white paper imple-
mentation plan, 2001.
odpm urbanpolicy/documents/pdf/odpm urbpol pdf 608154.pdf.

[20] J. Owens. Neural Networks for Video Surveillance. Phd, The University
of Durham, 2002.

[21] C. Phillips. A review of cctv evaluations: Crime reduction ef-

fects and attitudes towards its use.

[22] C. Stauffer and W.E.L. Grimson. Learning patterns of activity using
real-time tracking. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(8):747–757, 2000.

[23] Nick Tilley. Understanding car parks, crime and cctv: Evaluating lessons
from cctv. Police Research Group Crime Prevention Unit Series Paper
No. 42, 1993.

[24] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Prin-

ciples and practice of background maintenance. In ICCV (1), pages
255–261, 1999.

[25] E. Wallace and C. Diffley. Cctv: Making it work (sec-

tion 2.6.4), 1998.

[26] G. Welch and G. Bishop. An introduction to the kalman filter, 2001.


[27] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-

time tracking of the human body. In IEEE Transactions on Pattern
Analysis and Machine Intelligence Vol.19, number 7, 780-785 Aug, 1997.

[28] H. Zhao, R. Shibasaki, and N. Ishihara. Pedestrian tracking using single-

row laser range scanners, 2003.