Anda di halaman 1dari 10

Tracking moving video objects using mean-shift algorithm

Vladimir Nedovi Artificial Intelligence Department Faculty of Science University of Amsterdam

Abstract The implementation of the kernel-based tracking of moving video objects [1], [2] based on the mean shift algorithm [4] is presented. We show that the algorithm performs exceptionally well on moving objects in various video sequences and that it is robust to changes in shape as well as almost complete occlusion. We also propose possible extensions of the current implementation and future work that might be done in this area. 1. Introduction The tracking of moving, non-rigid objects in videos is an important and challenging task in the field of computer vision and artificial intelligence that has many applications, such as video surveillance (of humans and vehicles), traffic control, and sports videos as well as video summarization, compression, and multimedia mining. This report presents results obtained by implementing state of the art, kernel-based tracking algorithm using mean shift that was introduced in [2]; in this approach, objects of interest are characterized by the probability density functions of their colour or texture features. By masking the distribution with a monotonically decreasing kernel, a spatially-smooth similarity function is defined and mean shift iterations can use the gradient of this similarity function as an indicator of the direction of targets movement. The similarity is expressed in terms of Bhattacharyya coefficient, which is argued to be much more suitable [1], [2] than many more commonly employed techniques, such as histogram intersection. 2. Related work Tracking of moving objects in video sequences is a very active research area and there are many approaches which attempt to develop robust techniques for varying video conditions (such as partial or complete occlusions, clutter, noise, etc.). For example in [5], objects are tracked based on the basis of texture and edge correlation criteria and the motion model is rigid, affine, or homographic. However the correlation is not invariant to illumination, and Hager and Belhumeur [6] extend this approach by employing an SSD (sum of squared distances) tracking model that is insensitive to illumination changes; the model was improved further in an energy-based minimization procedure of [7], where the emphasis is on objects shape and appearance, modeled with a color texture map. A mixture of stable image structure together with an outlier process was presented in [8], and in [9] a somewhat different approach using an affine transformation based on planar regions was

employed. Tracking of people emerged as an area by itself and there is an extensive research done in that field, as presented in many publications. However, because of the real-time constraints, more complex techniques like covariance-scaled sampling [10] are usually replaced by simpler multi-class statistical color and shape models [11] or adaptive mixtures [12]. In our project, we used a mean-shift algorithm, which is a non-parametric (i.e. kernel) density estimator that optimizes a smooth similarity function to find the direction of the targets movement. The similarity function is obtained by masking the objects colour distributions with an Epanechnikov kernel, and its smoothness (due to the uniformity of the kernel profiles gradient) allows the use of a gradient-optimization method. Thus the algorithm focuses on a much smaller neighborhood and can outperform the exhaustive search. We decided to use histograms as the representation of the objects colour probability density functions (pdfs), as they can satisfy the low-cost requirement of real-time tracking. In the following sections we give a brief overview of the kernel-based techniques and the mean shift algorithm (details can be found in [1], [2]) as well as the details of our implementation. 3. The overview of kernel tracking and mean shift procedure 3.1 The kernel mask As mentioned above, the objects density estimates (i.e. histograms) were weighted by a monotonically decreasing Epanechnikov kernel given by KE(x) = cd-1(d+2)(1-||x||2) where cd is the volume of the unit d-dimensional sphere and x are the normalized pixel coordinates within the target, relative to the center (i.e. ||x||2 is a squared Euclidean distance of each pixel from the center of the target see Figure). Since we were dealing with a two-dimensional image space, our kernel function was of the form KE(x) = (if ||x|| < 1 and 0 otherwise) (1)



The rationale for using a kernel to assign smaller weights to pixels farther from the center is that those pixels are the least reliable, since they are the ones most affected by occlusion or interference from the background (i.e. color-bleeding due to low

resolution, transmission errors and noise). A kernel with Epanechnikov profile was essential for the derivation of the smooth similarity function between the distributions, since its derivative is constant; thus the kernel masking lead to a function suitable for gradient optimization, which gave us the direction of the targets movement. The search for the matching target candidate in that case was restricted to a much smaller area and therefore much faster than the exhaustive search. 3.2 Distance minimization Based on the fact that the probability of classification error is directly related to the similarity of the two distributions, the choice of the similarity measure in [1] was such that it was supposed to maximize the Bayes error arising from the comparison of target and candidate pdfs. Being a closely related entity to the Bayes error, a Bhattacharyya coefficient was chosen and its maximum searched for to estimate the target localization. Bhattacharyya coefficient of two statistical distributions is defined as

[p(y), q] =

p z ( y ) q z dz


and in our case distributions pz and qz are the histograms of the target and the candidate, respectively. 3.3 Mean shift In order to find the best match of our target in the sequential frames, we needed to maximize the Bhattacharyya coefficient, which means (as explained in [1]) that we needed to maximize the term

i i =1

y xi h


where h is the kernels smoothing parameter, or bandwidth, and i is given by

i =
u =1

u q [b(x i ) u ] u (y 0) p


and is the Kronecker delta function, equal to 1 only at u and 0 otherwise (i.e. only equal to 1 at the particular bin

u and p u are the values of the target and candidate histograms corresponding to pixel xi of the u). The terms q u and p u can be visualized to demonstrate how the candidate object. This mapping of colour values given by q
target object changes over time and what the corresponding distribution of weights is; the figures below show the grayscale images of some of these mappings, taken from the football sequence:

Figure 2: The weights i displayed as images Note that the sum in (4) is actually a density estimate (i.e. a histogram) of the object centered at y in the current frame, computed with a kernel profile k(x) and weighted by i. The maximum of this density in the local neighborhood (starting from the last known position of the target) gives us the most probable target position in the current frame, and it can be found by employing a mean shift procedure. During this procedure, the center of the target candidate is successively shifted by

= y

xi y x ii g ( 0 h i =1


xi y i g ( 0 h i =1


0 is the current location of the candidate center and g(x) is the derivative function. Since the derivative of where y
the Epanechnikov kernel profile is constant, the above expression reduces to a weighted distance average

= y

i i =1 n


i =1

The details of the mean-shift procedure are outlined in [1] and [2]. Here we will only refer to the implementation choices that we made and the results that we obtained. 4 Implementation details 4.1 The colour model Since one of the biggest issues in visual tracking is the robustness of the algorithm under changing video conditions, including illumination and shape changes, the first problem that we addressed was the choice of the colour space in which our algorithm would operate. We needed the colour model that is invariant to illumination changes and changes of the objects shape, both of which are present in most of the video sequences. The easiest alternative was to choose the normalized rgb colour space, since it is invariant to viewpoint, illumination, and object shape changes [3]. For our task, we could also use hue and saturation, which are even less dependent on the

changing conditions. However, since we decided to use colour histograms as the representation of the objects color probability density functions, three features of the rgb color space gave us better discriminating power with a three-dimensional histogram than would hue and saturation with a two-dimensional representation. 4.2 Binary masking As in the related literature on tracking, we made a simplifying assumption that the segmentation module has detected and localized our object of interest in the first video frame and that we know exactly its position, as well as its shape and dimensions (in the first part of the semester, we implemented algorithms for detection and localization using histogram back-projection - that is one of the methods that can be used for this purpose). Therefore, before starting with tracking, we used binary masks to extract our targets from the initial frames and find their colour histograms, except in case of the cow chase video, where a representative portion of the target object was chosen for histogram calculation. The reason for this was that the targets in our implementation were defined as rectangular regions bounded from each side by the extent of the object, and since they contained a large number of black pixels after the mask has been applied (those not corresponding to our target), the value of the histogram bin corresponding to black colour was reset to 0. In the cow chase example however, the colour of the target object is black, and the resetting of the first bin to 0 led to a significant loss of information in the histogram; therefore, an alternative approach to histogram evaluation was chosen, which did not affect the performance of the algorithm. 4.3 Colour histograms For our implementation, we closely followed the choices and the algorithm outlined in [1] and decided to use colour histograms of the normalized rgb model. The input frames were first converted to the rgb space (since I = R + G + B, all pixel values were divided by the sum of their R, G, and B component values), eliminating the intensity information from colour, and thus eliminating the colours dependency on it. Then a weighted 3D histogram of the three components was calculated (the number of bins in each dimension was restricted to 16, estimated to give enough discriminating power to the objects colour distribution). The weighting kernels were adapted to the size of the target by the choice of the smoothing parameter h, which normalizes the targets ellipsoidal region, as defined in the paper, to a unit circle (by dividing each distances coordinates independently by hx and hy). However, in our case the target region was bounded by a rectangle, in which maximum distance from the target can be greater than height ; therefore, we decided to use a bandwidth of
2 *{height, width}. (The 2

other alternative was to reset all the weights corresponding to pixels outside of the unit circle, to 0.) The figures below show a weighted 3D histograms projections into a 2D space for all three combinations of colors, evaluated on the target from a football sequence (a 3D histogram could be represented as a cube, but we chose an alternative representation in a 2D space by summing over the third dimension in each case):

Figure 2: Projections of a 3D rgb colour histogram into a 2D space

4.4 Mean shift implementation Even though the authors of [1] mention that Step 5 (i.e. checking whether the estimated new position is an overshoot) is not needed since only in 0.1% of the cases the Bhattacharyya coefficient derived at a new location did not increase, we implemented the whole algorithm and observed that the loop in Step 5 is often used. This is probably the case because some of our targets do not always move from frame to frame, and thus they do not necessarily obtain a new position where the similarity coefficient would be greater. However, although we often have overshoots with our gradient estimator, the motion in both of our video sequences is also very fast at moments, and thus we chose the value of the shift threshold to be 3 pixels; this was estimated to be the best compromise given the unequal velocity of our targets.

5 Experimental results We tested our algorithm on various portions of two video sequences, which we call the football sequence and the cow sequence. In all cases the algorithm proved to be very robust to various changes in the targets shape and size as well as to partial occlusions. The resulting video clips, assembled from the frames on which the tracking was performed, can be found at: http:\\\~vnedovic\meanShift.html. For the football sequence, the resulting video player.avi was obtained as we ran the algorithm over the portion of 100 frames which contained changes in objects shape and size, as well as several partial occlusions of our player of interest. As can be seen from the video and the figures below, the algorithm is robust to all these conditions (the first figure shows the initial frame in which we started tracking our player of interest; the next two show some characteristic frames in which a large amount of occlusion is present; and the fourth shows the last frame in the sequence):

Figure 3: The initial football frame, the frames of the first and second occlusion and the terminating frame

The green tail following the player marks his trajectory from the initial position; note that this curve connects the points on the players trajectory relative to the image coordinates, as the camera moves a bit during the frame sequence (the same applies to the cow sequence below). For the cow sequence, we obtained even better results, since the 190 frames on which we tested the algorithm contain the object in plethora of different sizes and shapes, and portions in which the target object is almost completely occluded. Some of the results extracted from the video cow.avi are shown in the figures below. Note that the middle figure is such that even humans cannot easily distinguish the contours or position of the target; moreover, another object of very similar colour is occluding the target and the tracker could mistakenly replace the original object by this one. However, our algorithm manages to keep the cow as its target and to avoid occlusion problems. The rightmost figure shows the terminating frame with a complete trajectory (again relative to the frame coordinates).

Figure 4: The initial cow frame, the frame with almost complete occlusion, and the terminating frame

The results of tracking in both sequences and the performance of the algorithm in various difficult situations and under varying conditions can best be visualized by running their respective video clips from the URL address provided above.

The sequences were tested on a 1.5 GHz machine with 512 MB of memory: 100 football frames took 91 seconds (1.5 min) to process, while 190 cow frames took 136 seconds (2.3 min), which is satisfactory, considering that both of these results were obtained within a MATLAB environment. It is assumed that a C implementation could produce much more desirable results. 6. Future work Our implementation of the mean shift algorithm has proven to be robust to changes in shape and occlusion, stop and go conditions, as well as some changes in size. The changes in size that we experienced, however, were not drastic, and our algorithm worked, even though we did not explicitly adjust the kernel or the other parameters for this situation. In fact, adapting the size of the kernel with a weighting sum of the previous and current sizes, as proposed in [1], gave worse results and we chose not to include it in our implementation. So the first extension of the algorithm would be to find a good estimate of the adaptive scaling parameter to account for objects changing dimensions. Another addition to the mean shift method would be to attempt to make the tracker insensitive to the changes in camera viewpoint (i.e. different camera shots), in which the target object could reappear at a very different position and in a much different size than in the previous frames. To account for this situation, we could ask our segmentation module to localize our target in a new set of frames, or alternatively extract some additional discriminative features of our target over time in addition to the colour histograms. It is also possible to combine the feature histograms in different frames or situations (different camera shots) by selecting the most discriminative color or texture feature (or a set of features) for a particular situation, thus making the algorithm adaptive to changes in viewpoint and color, as well as noise degradation. We could also try to deal with complete occlusions by setting a threshold on the similarity coefficient and waiting for a couple of frames, until our target reappears and we have a satisfactory degree of similarity again. This introduces the issue of selecting an optimal threshold on the similarity measure as well as selecting the right number of frames to skip (risking to lose the position of the target). To make the algorithm much more efficient, the implementation in C or C++ would be required, thus making the module really applicable to the real-time situations.

7. References [1] D. Comaniciu, V. Ramesh, P. Meer, Kernel-Based Object Tracking, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.25, No. 5, 2003. [2] D. Comaniciu, V. Ramesh, P. Meer, Real-Time Tracking of Non-Rigid Objects using Mean Shift, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 00), vol.2, 2000.

[3] Th. Gevers, Color in Image Search Engines, Survey on color for image retrieval from Multimedia Search, ed. M. Lew, Springer Verlag, 2001. [4] D. Comaniciu and P. Meer, Mean Shift: A robust approach toward feature space analysis, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.24, No. 5, 2002. [5] B. Bascle and R. Deriche, Region tracking through image sequences, in Proc. of 5th Intl. Conference on Computer Vision, Cambridge, MA, 1995. [6] G. Hager and P. Belhumeur, Real-time tracking of image regions with changes in geometry and illumination, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Francisco, CA, 1996. [7] S. Sclaroff and J. Isidoro, Active blobs, in Proc. 6th Intl. Conference on Computer Vision, Bombay, India, 1998. [8] A. Jepson, D. Fleet, and T. El-Maraghi, Robust online appearance models for visual tracking, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Hawaii, vol. I, 2001. [9] V. Ferrari, T. Tuytelaars, and L. V. Gool, Real-time affine region tracking and coplanar grouping, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Hawaii, vol. II, 2001. [10] C. Sminchisescu, and B. Triggs, Covariance scaled sampling for monocular 3D body tracking, in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Hawaii, vol. I, 2001. [11] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, Pfinder: Real-time tracking of the human body, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19, 1997. [12] S. McKenna, Y. Raja, and S. Gong, Tracking colour objects using adaptive mixture models, Image and Vision Computing Journal, vol. 17, 1999.