Image Processing On Mobile Platform

IMAGE PROCESSING ON A MOBILE PLATFORM
A thesis submitted to the University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences
2009
By Samantha Patricia Bail School of Computer Science
Contents
Abstract Declaration Copyright Acknowledgements 1 Introduction 1.1 Description of the Project 1.2 Motivation . . . . . . . . . 1.3 Main Objectives . . . . . . 1.4 Scope . . . . . . . . . . . 1.5 Dissertation Overview . . 5 6 7 8 9 9 10 11 11 13 15 15 15 18 18 19 23 24 30 31 31 31 33
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
2 Project Background and Literature Review 2.1 Overview . . . . . . . . . . . . . . . . . . . . 2.2 Mobile Platforms . . . . . . . . . . . . . . . 2.3 Mobile Phones as Assistive Devices . . . . . 2.4 Image Processing and Object Detection . . . 2.5 Related Work . . . . . . . . . . . . . . . . . 2.6 Analysis of Methods for Object Detection . 2.7 Factor Graph Belief Propagation . . . . . . 2.8 Chapter Summary . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
3 Application Design 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.4 3.5 3.6 3.7
Image Processing Methods and Algorithms Training Images . . . . . . . . . . . . . . . Issues Aecting the System Performance . Chapter Summary . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
36 44 45 46 47 47 47 48 49 53 55 55 56 57 57 57 60 61 62 62 62 64 65 66 67 67 69 72 75
4 System Implementation 4.1 Overview . . . . . . . . . . . . . . . . . 4.2 Implementation Tools . . . . . . . . . . 4.3 Image Capturing . . . . . . . . . . . . 4.4 Phase One: Feature Extraction . . . . 4.5 Phase Two: Object Recognition . . . . 4.6 Result Output . . . . . . . . . . . . . . 4.7 Optimisation for Symbian S60 devices . 4.8 Chapter Summary . . . . . . . . . . . 5 Testing 5.1 Overview . . . . . . . . . . . . . . . . 5.2 Description of the Testing Procedures 5.3 System Performance Evaluation . . . 5.4 Chapter Summary . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
6 System Evaluation 6.1 Overview . . . . . . . . . . . . . . . . . 6.2 Analysis of the Research Methodology 6.3 Review of the Project Plan . . . . . . . 6.4 Improvements . . . . . . . . . . . . . . 6.5 Chapter Summary . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
7 Conclusion and Future Work 7.1 Project Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography A Listings
List of Figures
1.1 2.1 2.2 3.1 3.2 3.3 3.4 4.1 4.2 Two exit signs according to BS 5499-4 . . . . . . . . . . . . . . . Worldwide smartphone sales to end users 2008 . . . . . . . . . . . Example of a factor graph . . . . . . . . . . . . . . . . . . . . . . Class diagram showing the organisation of the application classes State diagram for the emergency exit sign recognition software . . Sobel kernels used for horizontal and vertical derivatives . . . . . Four examples of emergency exit signs captured with a phone camera Individual steps of edge detection . . . . . . . . . . . . . . . . . . Two examples of binary sign templates . . . . . . . . . . . . . . . 13 17 25 35 36 37 45 51 54
Abstract
Emergency exit signs are an indispensable part of any safety precautions for public buildings. In case of an emergency, they indicate safe escape routes and emergency doors, using an internationally recognizable sign: A green and white sign with icons showing a running person, a door, an arrow pointing into the direction of the escape route and the word Exit (or other words describing an emergency exit), in dierent combinations. These signs can be easily detected and interpreted by sighted people, but are unsuitable for visually impaired persons who cannot rely on visual indicators. This project deals with the issues of recognizing emergency exit signs with a mobile device. It describes the development of a piece of software that runs on a Symbian OS smartphone and can be used to detect emergency exit signs using the phones camera. In case of a detection, the device indicates this through an acoustic signal and, if an arrow is present on the sign, the software species the direction through text output. In order to achieve fast processing times, the study also deals with the low computing power of smartphones. The chosen approach is based on belief propagation on factor graphs, a method drawn from statistics, which is used in combination with other image processing tasks such as template matching. While the success of an ecient implementation depends strongly on the observance of necessary optimisations in both the choice of algorithms and coding practice, the general feasibility of image processing on the chosen mobile platform is demonstrated by this project.
Declaration
No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualication of this or any other university or other institute of learning.
Copyright
i. Copyright in text of this dissertation rests with the Author. Copies (by any process) either in full, or of extracts, may be made only in accordance with instructions given by the Author. Details may be obtained from the appropriate Graduate Oce. This page must form part of any such copies made. Further copies (by any process) of copies made in accordance with such instructions may not be made without the permission (in writing) of the Author. ii. The ownership of any intellectual property rights which may be described in this thesis is vested in the University of Manchester, subject to any prior agreement to the contrary, and may not be made available for use by third parties without the written permission of the University, which will prescribe the terms and conditions of any such agreement. iii. Further information on the conditions under which disclosures and exploitation may take place is available from the Head of the School of Computer Science.
Acknowledgements
I would like to thank my supervisor Dr Tim Morris for his support and helpful guidance throughout all stages of the project, as well as Dr David Rydeheard who would always provide me with good advice whenever I came across any diculties on the course. Many thanks to Marcus Groeber for his advice regarding Symbian, and to Volodymyr Ivanchenko for providing the Crosswatch application for testing purposes. Thanks to my family and especially my mother and grandfather who supported me during my never ending studies (Im o to the next round). My thanks goes out to Simon for his incredible patience, as well as his family for all their help. Thanks to all my friends in the UK and in Germany, especially to Dr B., and to my housemates for their motivational talks. I would also like to mention all the students who spent so many days (and nights) in the MSc lab and provided me with advice and chats. Danke.
Chapter 1 Introduction
1.1 Description of the Project
Visual signs provide a means of orientation for sighted people within unfamiliar locations such as oces, hospitals and other public buildings. Particularly in emergency situations, emergency exit signs point the way to important escape routes, thus making them a legal requirement for buildings of a certain size. However, for people with visual impairments, these vital resources cannot be utilized as a guidance aid. Using a mobile tool to detect these emergency signs and output the necessary information in acoustic form can make them accessible to people who cannot rely on their eyesight to recognize visual objects. This can be helpful in unknown or complex buildings, when the escape routes cannot be memorized and there is no other person immediately available that could provide guidance to nd the right escape route. This project will carry out research into the feasibility of such a guidance system, analyse dierent methods to achieve the task and describe a way of implementing the system on a mobile platform. Upon completion of the work, we will have gained insights into an ecient implementation of computationally demanding procedures such as computer vision algorithms on mobile devices with low processing power. In addition, the software will be a demonstration of how modern technology such as the smartphone platform with its wide scope of possible applications can be used to assist blind and visually impaired people.
1.2. MOTIVATION
10
1.2
Motivation
There are over two million people in the UK living with signicant sight loss, out of which over 300.000 are ocially registered as blind or partially sighted [RNI]. Numerous tools and techniques are available to blind people to help them complete everyday tasks more safely and with greater independence. Such assistance can come in the form of guide dogs and white canes (for navigating around unfamiliar obstacles in public spaces) but also in lesser known forms, an example of which are digital water level sensors that sound an alarm when a vessel is full. The use of modern information technology has become increasingly popular in the past few years, with companies providing mobile talking book players, braille output devices for mobile phones and text-to-speech software for computers. To sighted people, many everyday tasks such as locating exit signs in public places are hardly thought about; it is something that is done almost subconsciously. However, for a blind or partially sighted person, not being able to identify the quickest and safest way out of a building can have serious, potentially dangerous consequences. It is this particular problem that will form the core of this study. While mainly based in the discipline of computer vision, this project has two important aspects: First, adapting modern technology in order to provide assistance to visually impaired people, without the need to produce specically designed devices for them, which is connected to the notion of accessibility. Secondly, the implementation of a computationally demanding task such as image processing on a platform with restricted computing power. This fact makes it necessary to move away from some of the traditionally used methods that prove to computationally demanding, and explore novel approaches, simplied versions of algorithms and approximations that can be used to achieve a lightweight implementation. We acknowledge that the idea of using computer vision for visually impaired people is not ground breaking,however, it still is rarely seen on mobile platforms. We hope to give an insight into the dierent possibilities that modern mobile systems oer, and provide the basis for further research in this area.
1.3. MAIN OBJECTIVES
11
1.3
Main Objectives
The aim of this application is to provide visually impaired people with a method of recognizing emergency exits1 independently, using an out-of-the-box mobile phone with a built-in camera. The ideal process when using the application would include the following steps: The user opens the application on the phone, if possible via a shortcut The user pans the phone from side to side If an emergency exit sign is detected, the application outputs an acoustic signal (a beep) If the sign contains an arrow, the application outputs the direction of the arrow (e.g. Arrow points to the right) The user knows the location of the sign and where to proceed from there (e.g. at the next door) It is obvious that these signals can only function as a pointer to indicate the approximate direction of an emergency exit. Parameters like the location of the sign (above a door, on the wall etc.) in the room or the exact distance from the camera would make the application more useful, but are dicult to determine. However, a rough acoustic description of the direction is already one step ahead of signs that are virtually useless for visually impaired people. It can help making decisions, for example when standing in the middle of a corridor, in which direction to proceed to get closest to the nearest exit. By also describing the arrow on the sign (if present), walking in a dierent direction than the exit can be prevented, which makes this an important part of the applications output.
1.4
Scope
In order to achieve the previously mentioned objectives, the scope of the project has to be clearly dened.
We have picked emergency exit signs for this task as an example for the technology used in the project, as they are easily recognisable and standardised. However, we would like to point out that the methods discussed in this study could be applied to any other type of signs that is based on a common standard.
1
1.4. SCOPE
12
First, it has to be specied what exactly should be recognized by the application. The basic design of emergency exit signs is similar for most countries despite there being no mandatory international standard for emergency exit signs. Most signs include a stylised symbol of a running person (sometimes in front of a rectangle that represents a door), an arrow and the words Exit, Emergency exit, Fire Exit or similar, in various combinations, but always green2 on white background (or vice versa). Depending on the surrounding lighting conditions, the signs can be either lit from the inside or externally. These dierences do not cause any problem for people who can see the signs and interpret them as similar on the basis of their typical colour and contents. However, when trying to apply automated recognition strategies to dierent types of signs, these are likely to fail. This is why a decision was made to constrain the set of signs that should be recognized to emergency exit signs that were designed according to the British Standard BS 5499-4 [BS500], which are widely used in most public buildings in the UK. Signs of this type are composed of up to three dierent parts: A running gure (running to the left or to the right) An arrow pointing in the direction of the escape route The word Exit or Fire exit Even with this constraint, we are still confronted with a problem caused by the often low quality of built-in mobile phone cameras. When trying to capture sample pictures of exit signs that were illuminated internally, the light intensity of the sign leads to a large white spot on the image. This overexposure makes recognising any sign impossible. Since it cannot be expected that mobile phone cameras have the means of automatically adjusting the exposure time to correct the aws, this type of internally illuminated exit signs simply has to be removed from the set of recognizable images. This reduces the task to recognizing emergency signs that were designed according to BS 5499-4, and which are not internally illuminated. Two examples of these signs are shown in gure 1.1. The signs always consist of the same three parts, however, their layout diers depending on the location of the exit. Signs that point at a location to the right
2
In the case of BS 5499-4 exit signs, the shade of green is Pantone3405CVC
1.5. DISSERTATION OVERVIEW
13
Figure 1.1: Two exit signs according to BS 5499-4
(i.e. right, up, down, top right, bottom right) have the arrow on the right hand side, with the running person facing the right. Accordingly, all signs with an arrow pointing to the left, top left or bottom left, place the arrow on the left hand side, with the running icon also facing the left.
1.5
Dissertation Overview
The structure of this dissertation is roughly based on the chronological development of the project: In chapter 2 we will discuss the usage of mobile phones as tools for visually impaired users. We will then give an overview of the domain of computer vision and its sub areas that are relevant for the given task, such as image processing and object detection. This is followed by an extensive review of related works of similar nature, i.e. image processing applications on mobile platforms, which our project will be based on. Chapter 3 will evaluate dierent mobile platforms with respect to their suitability for the task, and their eciency of carrying out computationally demanding processes like object detection. After deciding on the platform, the available tools and methods will be reviewed, which will form the basis of the actual implementation of the software. We will then describe the general application design and outline implementation details, such as the image processing algorithms that will be used. In chapter 4, details of the implementation on the mobile platform will be explained, along with a discussion of the methods necessary for optimising system performance. Given the rather unfamiliar mobile platform and programming language Symbian C++, we will also include code snippets to describe the most important modules of the system and highlight signicant details.
1.5. DISSERTATION OVERVIEW
14
This is followed by a description of the testing procedures and an evaluation of the implemented system with respect to the test results in chapter 5 and 6 respectively. In addition, the chosen approach is analysed in the context of projects with a similar background, where some of the advantages and disadvantages will be discussed. In this chapter we will also review the project plan with respect to the project ow. In the nal chapter, we will summarise the project with respect to the tasks performed during the course of this study and the ndings discussed in the dierent chapters. The work will be concluded by an overview of possible future developments and applications based on the work performed in the course of this project.
Chapter 2 Project Background and Literature Review

2.1 Overview
This chapter will discuss the project background with regard to the status of the chosen platform and the foundations of the research area it is based on. This will be followed by an extensive literature review that discusses and analyses the research that was carried out in similar projects, and their approaches to the problem of image processing on mobile devices. We will then look into details of the most suitable methods for the given task and draw a conclusion regarding the chosen approach for our project.
2.2
2.2.1
Mobile Platforms
Hardware Suitable Devices
Nowadays, the vast number of mobile phones that are available to the public oer a multitude of designs and functionalities. For this project, certain requirements for processing capacity and user interface have to be met, which narrows the choice of phones down to a certain type. The term smartphone is generally used for a mobile phone that combines standard phone functions (phone calls, text messaging) with those of a PDA, such as internet access, e-mail tasks, multimedia players and oce applications [Yua05]. The most popular and established smartphones to date are the Blackberry line (RIM), the iPhone (Apple) 15
2.2. MOBILE PLATFORMS
16
and several Nokia devices (such as the N-series) and the market is ever-growing. The average processing power of smartphones seems appropriate for the computationally demanding task of image processing, as it has already been proved by several applications (see section 2). This is why we decided to choose a smartphone platform for this project rather than developing a Java application for a standard mobile phone. It can be also assumed that visually impaired users prefer to use devices with text-to-speech software, for which smartphones provide the most sophisticated platform. With respect to the hardware and user interface, several requirements have to be met for this task. The most obvious feature that is needed for image processing is an integrated camera which is suitable for capturing images in a sucient quality and resolution, such as 320x240 or 640x480 pixel [KT07, ICS08]. Since most mobile phones and smartphones come equipped with a camera that has a resolution of at least 1 megapixel (up to 8 megapixels), this criteria will be easily met by most available devices. Another important issue is the user interface, that is, the accessibility by visually impaired users. As previously mentioned, it can be assumed that users access the device through text-to-speech software that reads out the screen content and describes the phone menus. Interaction is then carried out using the phones buttons, which have to be felt out. This requirement rules out devices that are operated with touchscreens, as they provide no tactile feedback to the user1 . As a conclusion, the most suitable device for the given task is a smartphone that is able to run third party software, comes equipped with a camera and has tactile buttons. These criteria have to be considered when choosing the platform for the image processing software.
2.2.2
Operating Systems and Platforms
Smartphones are currently distributed with a wide range of operating systems, such as Windows Mobile, the BlackBerry OS, Symbian OS, Palm OS and Linuxbased systems. All systems oer dierent capabilities for installing and running
Nokia announced support for tactile feedback on touchscreens with the latest Symbian OS version 9.4 in late 2008. However, this will not be discussed here, as it cannot be considered commercially relevant yet.
1
2.2. MOBILE PLATFORMS Operating System Symbian RIM (BlackBerry) Microsoft Windows Mobile Mac OS X (iPhone) Linux Palm OS Other Total in 2008 Sales in Thousands 72,933.5 23,149.0 16,498.1 11,417.5 11,262.9 2,507.2 1,519.7 139,287.9 Market Share 52.4 % 16.6 % 11.8 % 8.2 % 8.1 % 1.8 % 1.1 % 100.0 %
17
Figure 2.1: Worldwide smartphone sales to end users 2008
third-party software, and most manufacturers provide APIs for various programming languages such as C, C++, Java and Python. With a market share of roughly 50% of the smartphone market, the Symbian operating system currently is the leading smartphone platform, as shown in gure 2.1 [Gar09]. Symbian OS is widely supported by Nokia2 and encourages developers to implement applications for its operating system by providing the necessary APIs and tools. This, and the wide range of available Symbian handsets, makes it a suitable system to reach as many users as possible. In particular, the majority of accessible Symbian devices runs the S60 version of this operating system. Based on previous works that used the Symbian platform to develop image processing applications, it can be assumed that devices with sucient processing power for this task are available. While Symbian also supports Java and Python, Symbian C++, a C++ dialect, is labelled the fastest and most ecient programming language on this system. [ICS08] even states that for the task of recognising zebra crossings with a mobile phone, Real-time performance [. . . ] is made possible by coding in Symbian C++. These points lead to the decision to develop the software for recognising emergency exit signs on the Symbian platform, using Symbian C++. The device used for this application is a Nokia N95 smartphone running the 9.2 version of Symbian OS. Both Symbian and Nokia oer extensive and up-todate online resources with detailed information on Symbian C++ programming [Sym09, For]. The online forums oered on both websites in particular act as a helpful source, along with sample code of which parts (e.g. tutorials for the camera API) were used as a starting point for this project.
2
Symbian Software Limited was in fact acquired by Nokia in 2008.
2.3. MOBILE PHONES AS ASSISTIVE DEVICES
18
2.3
Mobile Phones as Assistive Devices
Mobile phones, and smartphones in particular, provide a platform for a wide range of applications for visually impaired users. Interaction with the phone is made possible by a screen reader that outputs the displayed text via textto-speech, and therefore allows access to text-based information such as phone menus, internet content or text messages. Many smartphone applications were developed specically for use by visually impaired people. These include OCR3 applications that make use of a built-in camera, navigation software and audio players for talking books in the DAISY4 format. Due to the existence of a screen reader on the device, the applications do not need to implement their own textto-speech solutions, but only ensure accessibility by the screen reader. The benets of smartphones as assistive devices lie in the convenience of using an out-of-the-box platform with additional software, rather than a hardwarebased implementation that was designed for a single purpose. This includes lower costs of an all-in-one tool, as opposed to multiple devices, the relatively small size of modern phones and the comfort of not having to carry several devices.
2.4
Image Processing and Object Detection
In order to produce a correctly and eciently working piece of software, it is important to analyse the basic requirements for the given task with respect to the underlying principles of computer vision. This discipline, part of the domain of articial intelligence (AI), aims at emulat[ing] human vision with the aid of computers [GW02]. The general description of this term is the detection and recognition of objects on the basis of an input image, which leads to a decision on the contents and nature of the object and, eventually, a reaction of the system. However, there is no clear denition of the boundaries of this discipline and those of the subareas it comprises. In the literature, it is often implied that, while being an extensive research subject itself, image processing is a subarea of computer vision. It deals with the processing and analysis of an image in order to manipulate the image, which yields another image (for example, by applying a masking algorithm to detect edges in an image) or to obtain information about
Optical Character Recognition Digital Accessible Information SYstem Digital talking book format, based on XML applications, that was specically developed for visually impaired users.
3 4
2.5. RELATED WORK
19
the image (such as a histogram of the images tonal values). The area of object detection is of particular importance for the task of recognizing emergency exit signs. This term describes the process of nding and identifying an object of interest in a given image; in this case, a rectangular plate in a room or corridor. In order to recognize the object as an emergency exit sign (as opposed to other signs), and the direction it points at, the object has to be classied. This process, again, utilizes methods from AI (and neural networks in particular), such as prior training of the system (on a set of positive and negative images), probabilistic methods and statistics. As for a research discipline that has been studied for several decades, there exists a large number of dierent algorithms (and implementations) for the various tasks of computer vision. Given the limited computing capacity of mobile phones, the suitability of dierent algorithms for ecient implementation on this platform has to be analysed. In the next section, we will look into projects that deal with similar problems and review the dierent methods, which will then provide a basis for our implementation.
2.5
Related Work: Image Processing on Mobile Platforms
2.5.1
Server-Client Systems
It is only recently that researchers started studying the issue of image processing on mobile devices. Due to the restricted computing capacity on mobile phones and PDAs, there are dierent approaches to dealing with this issue, of which the most signicant will be discussed in this section. One of the early solutions is the use of a server-client based system. The user captures an image with the mobile device, which is then sent to a server that carries out the actual processing work. After processing, the reply is sent back to the user via the mobile phone network. Various commercial providers use this method for mobile marketing in high prole campaigns, such as presented in [Koo]. The major advantage is that the task can be carried out with any kind of mobile device that has a camera, no matter what computing capacity it has. However, this system requires a phone network connection to be available, which can be dicult in certain areas or inside buildings. [RR06] uses a similar
2.5. RELATED WORK
20
server-client solution to recognize street name plates and use them as links to further information that is available on-line. The system runs on a PDA with a touchscreen, which allows the user to manually highlight the area that contains the street sign (area of interest, and therefore simplies the issue of object detection in the image. Several feature extraction algorithms (SIFT, Black/White, Wavelet, HSV) were analysed for this task, with SIFT proving to be the most eective algorithm that is invariant to perspective bias and varying lighting conditions. Due to the server-client structure however, there were no investigations into whether SIFT is also suitable on devices with low computational resources. In 2009, Nokia launched their Point & Find software, which is the rst system that does not restrict the use to a certain type of objects [Kob09]. After capturing a picture of the object, such as a lm poster, the image (and, if available, GPS information on the users location) is sent to a server that searches a database. Additional content and information on the object is then sent back to the handset. Nokia aims at extending the use of Point & Find to a wide range of commercial applications, such as barcodes and museum exhibits.
2.5.2
On-device Image Processing
A dierent approach to the previously mentioned that focuses on carrying out the image processing on-device is the use of 2-dimensional barcodes or QR Codes. With this method, the user captures an image of the barcode (e.g. on magazines, posters), which is processed immediately and turned into a URL that leads to further information on a website which is accessed via the phones browser. This established method has spread widely over the last few years (various examples can be found on [Mob]) and many phones already come equipped with a barcode reader [Nok]. A similar project (PhoneGuide) uses image processing on mobile devices to recognize exhibits in a museum [BB08]. A wide range of dierent algorithms have been examined for this purpose, such as pattern-matching, discriminate regions and SIFT, which all proved too inecient on computationally restricted devices. The chosen approach, a linear separation strategy implemented with an articial neural network, achieved the most correct and ecient object recognition. Several sets of normalized features (such as colour and structural features) were tested for object recognition, with colour features yielding the highest recognition
2.5. RELATED WORK
21
rate. However, due to the dierences of various mobile phone cameras, colour calibration is necessary if the camera used for training the algorithm diers from the users phone. The application, implemented on a Symbian S60 smartphone in Symbian C++, achieved a recognition rate of 90% in tests, with processing times of less than 1 second. All previously mentioned systems assume that the user points the camera directly at (or even manually marks) the object that is to be captured, with blurring, varying lighting conditions, scaling and perspective bias being the major issues that need to be addressed. A basic aspect of our project, however, is the softwares suitability for visually impaired people. In this case, not only the nature of the captured object is important, but detecting whether there is any object in the picture at all is even more critical. [PTAE09] emphasizes the importance of object detection as a rst step to recognizing text on street name plates. The system uses a boosting algorithm (AdaBoost) and Haar features for object detection. In order to correct the number of false positives, the system makes use of the textural information on street name signs (as opposed to windows and building facades that caused the false positives). The text on the signs is then recognized using a direct matching technique. Given the limited set of street signs that are to be recognized, this image matching approach is considered more ecient than character recognition. Although the system is intended for use on a mobile phone, the testing was only carried out on a desktop PC, which does not allow any statements regarding eciency. [GdGH+ 06] focuses on the eciency of a system for recognizing buildings (e.g. for use as a tourist guide) with mobile devices, making use of a local invariant regions algorithm. Several approaches for object recognition using global or local features are analysed, with global features such as colour distribution proving to be insucient and not robust to occlusion or dierent viewpoints. Algorithms such as SIFT that utilize local features are more robust to these problems, but are found to be inecient when carried out on a mobile device. In order to reduce computation time, the image data is compressed using principal component analysis. The similarity of the captured image to a building in the applications database is then determined using a voting scheme. Tests were carried out on a Sony Ericsson K700i and Nokia 6630, with both phones only supporting Java applications, and achieved recognition times of less than 5 seconds for one building. An application that is built explicitly for visually impaired users is described in
2.5. RELATED WORK
22
[ICS08]. The system is implemented in Symbian C++ on a Nokia N95 phone and detects zebra crossings in real time (3 frames per seconds), using the phones video capturing mode. The user points the camera in the estimated direction and the application outputs an acoustic notier if a zebra crossing is detected. The system is based on a feature extraction in the rst stage and gure-ground segmentation using a graphical model, the factor graph, in the second stage. Figure-ground segmentation5 describes the process of grouping pixels into object (gure) and background (ground) pixels depending on their compatibility as a group of gure or ground pixels respectively. Since mobile devices do not have oating point units (FPU), all oating point operations are carried out on a software-emulated FPU, which has great impact on the processing speed. In order to avoid oatingpoint calculations, the phone implementation uses a simplied version of factor belief propagation to perform statistical inference on the factor graph, as well as static arrays instead of dynamic lists. This application is the only known approach to date that aims at processing images on a mobile platform in real time and is therefore particularly interesting for our project in terms of eciency. The Symbian platform is also used to develop a mobile colour recognition software as described in [KT07]. Due to it being the native language of the Symbian operating system and providing very low level access to devices and other services, the C++ programming language is considered suitable for this task. The system was tested on a Nokia N93 smartphone running the Symbian S60 3rd edition operating system, and yielded a minimal processing time of 4.4 seconds after reducing the sample rate of the test image. Since this system is only colour based, it strongly depends on the lighting conditions and camera parameters, and is therefore very likely to produce incorrect results. Another very recent development (Summer 2009) is the use of mobile phones for augmented reality applications. The systems make use of the a smart phone camera to capture real-time images, process the image and output information based on the image. The Swedish company TAT [TAT09] announced their Augmented ID system that matches peoples faces with their prole in the database of the social network, using a 3D facial recognition method. It then displays personal information (such as Facebook or Twitter proles) as hovering icons around the persons face.
A term originating in the early 20th century Gestalt psychology dealing with human visual perception. The theory makes statements about how the visual system groups individual elements into objects, based on cues such as proximity and similarity.
5
2.6. ANALYSIS OF METHODS FOR OBJECT DETECTION
23
2.6
Analysis of Methods for Object Detection
In order to decide which method for object detection on a mobile platform seems suitable for our task, we have to look into the details of the approaches that were proposed in the previous section. Due to no prior knowledge of image processing on platforms with restricted computing power, the decision will be based on the ndings and conclusions drawn in previous research. While the SIFT algorithm (scale-invariant feature transform) is considered superior due to its invariance to image transformations (scaling, translation, rotation) [RR06], it is also labelled too inecient on mobile platforms [FZB+ 05]. By using the modied i-SIFT (informative SIFT), the applications runtime can be reduced, while yielding high recognition rates [GdGH+ 06]. However, the computationally demanding execution of this algorithm is still too time-consuming for mobile devices, which is why it can be generally ruled out as unsuitable for the given task. Another approach mentioned in similar applications is the use of a boosting algorithm such as AdaBoost (adaptive boosting). Boosting, which evolved from the domain of machine learning, is based on the combination of weak (i.e. only slightly better than random guessing) learning algorithms in order to produce one strong learning algorithm through training [FS99]. During the training, AdaBoost performs the weak algorithm repeatedly (e.g. 100 rounds) on a set of input values that are initially weighted equally. If a value is incorrectly classied, its weight is increased which grades it as hard example that the algorithm has to concentrate on. This training leads to a weak hypothesis for every round, which are combined to the nal hypothesis that yields a very low error rate. [PTAE09] describes the use of Haar-like features as weak classiers for general object detection. Haar-like features are image features that are represented as jointed black and white rectangles, the value of each feature being the dierence of the pixel grey level values within the rectangles. By using this method for classication instead of single pixel values, the classication process can be sped up signicantly. The advantage of AdaBoost is that the initial training can be carried out on an external device, which makes it independent from the mobile phones processing power. Implementations of dierent versions of this algorithm (namely AdaBoost.M1, and a more complex version, AdaBoost.M2 ) in various programming languages are available online and will be analysed for their portability onto
2.7. FACTOR GRAPH BELIEF PROPAGATION
24
the Symbian platform. However, as there are no test results for the eciency of AdaBoost implementations on mobile platforms (see 2.5), the suitability of this approach for our project has yet to be determined. The most promising method that achieved high performance rates on a mobile platform without the use of boosting is described in [ICS08]. The algorithm utilizes (max-product) factor graph belief propagation, a method drawn from the area of machine learning, for gure-ground segmentation in order to infer the state (gure or ground) of each segment extracted from the image. The belief propagation algorithm has only low complexity, as the required time grows only linearly with the number of nodes in the [graph] [YFW03] a clear advantage for the implementation on a mobile device with a weak processing capacity. Due to its convincing performance and the small set of training images necessary for recognising objects, we decided to implement the system based on a factor graph belief propagation method. The simplied max-product version of this algorithm, as proposed in [SC07], in particular is expected to allow an efcient implementation. The next section will give a more detailed explanation of factor graph belief propagation with respect to the task of image processing. This is completed by a description of the steps necessary for implementing the algorithm, which will be outlined in section 4.4.
2.7
2.7.1
Factor Graph Belief Propagation

Factor Graphs
Factor graphs are graphical models for the factorisation of global functions as a product of local functions, which represent the mathematical relation is an argument of between variables and the local functions. Factorisation of a function is the process of decomposing a global function g(x1 , . . . , xn ) into smaller parts, its factors, that have a subset of {x1 , . . . xn } as their arguments. The product of these factors (or local functions) then again forms the original function. Generally, the factorisation of a function g is dened as in [KFL01, p. 499]: g(x1 , x2 , . . . , xn ) =
jJ
fj (Xj )
(2.1)
2.7. FACTOR GRAPH BELIEF PROPAGATION This process can be visualised by a bipartite6 graph that consists of Variable nodes xi , the set of all variable nodes being X = {x1 , . . . xn }.
25
Factor nodes fj , representing local functions that determine probabilities. Undirected edges that represent the relationship fj (Xj ) . An edge between variable node xi and factor node fj exists if xi is an argument of fj , i.e. Xj is a subset of {x1 , . . . , xn }. Figure 2.2 (based on [SC07, p. 4]) shows an example of a factor graph with four variable nodes w, x, y, z and three factors f, g, h. The factor graph in this gure shows the joint distribution P (w, x, y, z) = f (w, x, y)g(x, y, z)h(y, z), which is represented by the edges between the variable nodes and factor nodes.
Figure 2.2: Example of a factor graph with variable nodes w, x, y, z (circles), and factor nodes f, g, h (squares).
With respect to the task of image processing, the nodes in a factor graph correspond to segments of the image which have to be classied into gure (segments that full the criteria for being part of the object) or ground (i.e. not part of the object, background). This makes the variable nodes in the graph binary, as they can have one of two states assigned: xi = 1 (gure) or xi = 0 (ground). The decision (or evidence) whether a segment is more likely to belong to gure or ground is based on cues that describe the relationship between neighbouring segments. These cues can be of any arity (such as unary, binary, ternary and so on) to take into account any number of segments. The evidence again is based on the statistical dierences between gure segments and ground segments, which are learned from empirical data. Using the evidence from all cues, a factor graph
Bipartite describes the fact that the nodes can be divided into two sets, with edges only running between nodes from dierent sets. Here, the two sets are variables and factors.
6
26
then represents the joint distribution of each nodes state based on this evidence [ICS08]. Based on a description given in the aforementioned source, the relationship between variable nodes and n-ary cues in a factor graph will now be explained in detail. The objective of this section is to clarify how we can infer the assignment of each segment xi , that is, how to determine the global state assignments (conguration) X = {x1 , . . . , xn } of all segments extracted from the image. Based on training data, it can be estimated that a certain number of segments is likely to be in gure state, independently from other segments this is an a priori belief or i.i.d.7 on X which is dened as:
n
P (x) =
i=1
fj (Xj )
(2.2)
In detail, we know that Pi (xi = 0) = p0 and Pi (xi = 1) = 1 p0 .This means that, without considering any relationships between two or more segments, it is already known that each segment has the likelihood of p0 to be in state 0 and 1 p0 to be in state 1 respectively. The probabilistic distribution p0 (ranging from 0 to 1) is determined through training data. A binary cue Cij describes the relationship between two neighbouring segments i and j. The relationship between this binary cue and the states of the two segments it relates is dened as the conditional distribution P (Cij | xi , xj ). Again, this distribution is learned from training data. It can be decomposed into two distributions Pon and Pof f , which describe the likelihood of the segments belonging to gure (on) or ground (o): Pon = P (Cij | xi xj = 1) (2.3)
is the distribution of the cue for both segments in gure state (xi xj = 1), and accordingly Pof f = P (Cij | xi xj = 0) (2.4) is the distribution if the product xi xj = 0, i.e. at least one of the segments is 0. Evidence whether the pair of segments belongs to gure or ground is then given by the dierences between the two distributions as log Pon (Cij )/Pof f (Cij ) .
Independent and identically-distributed. The states of all variables are independent from those of other variables, and each variable has the same probability distribution.
7
27
Generally, the set C of all cues Cij can then be related to the set of variable nodes X through the posterior distribution8 P (X | C) which is proportional to the product of the aforementioned a priori belief from equation 2.2 and the distribution for binary cues: P (X | C) P (X) P (Cij | xi , xj ) (2.5)
(ij)
Using the two equations 2.3 and 2.4, the product over (ij) can be rewritten as:
(ij)
P (Cij | xi , xj ) =
Pof f (Cij )
(ij) i,j:xi xj
Pon (Cij ) P (Cij ) =1 of f
(2.6)
The product i,j:xi xj =1 is restricted to xi xj = 1, which means that only pairs of segments that are both in gure state are taken into account. In this equation the product over Pof f is independent of X, which is why it can be removed from the posterior probability when combining equations 2.5 and 2.6: R(X | C) = P (X) which is equivalent to logR(X | C) = log Pi (xi ) +
i ij
i,j:xi xj
Pon (Cij ) P (Cij ) =1 of f
(2.7)
xi xj log
Pon (Cij ) Pof f (Cij )
(2.8)
Maximizing this expression leads to an estimate for the maximum a posterior 9 , or MAP. By using belief propagation on the factor graph, this MAP can be determined in an ecient way, which will be described in the ensuing section. Since the method uses more than just one cue, we have to add one term for each cue to the previous equation. For binary cues, this means that for each Pon (C ) additional cue a term of the form ij xi xj log Pof f (Cij ) is added to equation 2.8, i.e. ij the distributions for all cues are multiplied in order to determine the most likely global assignment of all variables in X. After dening how the distributions for each cue are computed, we will describe the process of constructing the factors for
The empirically determined probability, which summarizes the current state of knowledge about all the uncertain quantities [Gel02]. 9 The particular value of X that maximizes the posterior
8
28
each cue, which will be used in the factor graph. This process is carried out stepby-step beginning with binary cues: For each pair of variable nodes (neighbouring segments in the image) we determine whether they suce the cue and mark them as candidate factors, which will then be used to determine the candidate factors for 3-ary cues, which are in turn used to determine the factors for the arity-4 cues.
2.7.2
Belief Propagation on Factor Graphs
After this detailed explanation of how a factor graph is constructed, we will now illustrate how belief propagation is used to infer the likelihood of a node in the factor graph to be in a certain state. Belief propagation is a version of the sum-product algorithm, an algorithm used for message passing on factor graphs, that calculates the marginal probability (belief) for each node. Two types of messages are used in factor graph belief propagation: Messages sent from variables to factors, and those sent from factor nodes to variables, with both types of messages being functions of the variable that is associated with the edge along which the message is passed. We can explain the basic principle of message passing through the sum-product algorithm with the following two equations: The messages sent from variable nodes to factors are given by mxf (x) mhx (x)
hn(x)\{f }
(2.9)
Here, n(x) is the set of all factor neighbours of x in the graph. Equation 2.9 expresses that the message sent by a variable node is the product of all messages it has received from from other factor nodes h, i.e. the variable node simply forwards the messages10 . The factors here correspond to the local functions dened for the factor graph, i.e. the probabilities P for every cue based on its parameters. The messages sent by factor nodes are dened by the product of the factor itself with all messages sent from the variable nodes it is connected to, which is then summarised: mf x (x)
10
f (X) +
{x} yn(f )\{x}
myf (y)
(2.10)
The general approach to describing this process is to treat the graph as a tree and dene the message as the product of all messages received from child nodes. When implementing the algorithm, all nodes are treated as child and parent nodes to the nodes they are connected to.
29
X = n(f ) is the set of all arguments of f and {x} denotes the sum over all variables except x. The messages are updated until they converge, then the marginal distribution for a node x is computed as the product of all messages that are sent to x. In order to allow an ecient implementation on a platform with low computational power, the max-product version of belief propagation is used to estimate the maximum a posterior. By implementing this version in the log domain (taking the logarithm of all equations) where all calculations are reduced to addition and subtraction, ecient computation of the belief is made possible. The message updating equations in this max-product version are dened by the following two equations (note the sum and maximum here instead of product and sum as in 2.9 and 2.10) mxf (x) mhx (x)
hn(x)\{f }
(2.11)
mf x (x) max f (X) +

{x}
myf (y)
yn(f )\{x}
(2.12)
Eventually, the belief function for each node in the graph is calculated as: b(x) =
f n(x)
mf x (x)
(2.13)
In this framework, each factor f (x1 , . . . , xm ) in the factor graph is only nonzero, if all of its parameters {x1 , . . . , xm } are 1. As suggested by [SC07], a nonnegativity requirement is introduced in order to reduce the computational complexity of the method: Kf = f (x1 = 1, . . . , xm = 1) f (x1 = 0, . . . , xm = 0) = 0, that is, all factors have to be greater or equal than zero. Adding this to equation 2.13, the belief for each node is then computed as bx (x = 1) = f n(x) Kf and bx (x = 0) = 0, which then leads to the nal equation for the beliefs of all nodes: Bx =
f n(x)
Kf
(2.14)
Finally, with respect to the implementation of this algorithm, the notion of scheduled message passing has to be explained. It is assumed that the sending and receiving of messages is organised by a schedule (such as a timer) that species the
2.8. CHAPTER SUMMARY
30
way messages are passed. This schedule can be synchronous (ooding schedule), which means that all messages are updated at the same time, or asynchronous (serial), where only one message is updated at a time. Usually, several runs (sweeps) of the non-simplied message passing algorithm have to be performed in order for it to converge when all messages have been sent.
2.8
Chapter Summary
This chapter discussed the background and foundations regarding the given task of image processing on a mobile platform and described the notion of assistive technology for visually impaired people. Dierent types of smartphones were analysed which lead to the decision to develop the recognition system for Symbian OS, using its native programming language Symbian C++. As shown in this chapter, there exists a wide range of applications that deal with image processing on mobile phones, with dierent strategies for both the application structure (server-client, stand-alone) and the algorithms used for the processing task. After explaining factor graph belief propagation, we will now take a closer look at the application design and discuss how these methods will be integrated into the recognition system.
Chapter 3 Application Design

3.1 Overview
In this chapter, we will outline and discuss the preliminary considerations that have to be made before implementing the application. In the rst section, the functional and non-functional requirements for the recognition system will be listed, which is then followed by a detailed description of the software architecture. This includes an explanation of the dierent parts a Symbian OS application comprises, as well as an overview over the programs structural and behavioural organisation using UML diagrams. Finally, we will highlight details of the software such as the algorithms that will be used and give a short description of images necessary for training the system. The main objective of the chapter is to provide the reader with a clear idea of all the tasks that will be carried out from a high-level perspective. Detailed descriptions of the actual implementation will then be discussed in the ensuing chapter.
3.2
Requirements Analysis
In order to describe the necessary functionalities of the application and assist the evaluation process, the requirements that have to be met by the software and its user interface will be dened in this section. They are organised into two groups: The rst part lists functional requirements which describe the behaviour of the system, i.e. what the application does. The second group are non-functional requirements that describe how these tasks will be performed by the application.
31
3.2. REQUIREMENTS ANALYSIS
32
3.2.1
Functional Requirements
The program detects BS 5499-4 emergency exit signs that are not lit up internally (must-have) A detected object is indicated by a sound (must-have) Interactions with the software are conrmed to the user through text output (must-have) The capturing process begins automatically when starting up the software (should-have) Capturing the image and repeating the process takes one click and is repeated automatically if no object is detected, while the user is panning the phone (should-have1 ) If present, the direction of the arrow (left, right, up, down) on the sign is output as text (could-have) If present, any text on the sign (such as Fire Exit or Exit) is read out by the system The software outputs information on the distance of the sign from the user, based on the camera lens specications and the size of the detected sign (could-have)
3.2.2
Non-Functional Requirements
The execution time for one image lies in a time frame that is acceptable for the user, e.g. less than 2 seconds (must-have) The software works in various lighting conditions, outside a well lit environment (must-have) The application works correctly and does not show any unexpected behaviour or lead to system errors, such as program crashes (must-have)
Making the application as comfortable to use as possible is clearly an important objective, which would make this item a denite must. However, if the automated capturing proves to be too computationally demanding, we can abandon this feature without compromising the initial idea behind the project.
1
3.3. SOFTWARE ARCHITECTURE
33
The interface is accessible by screen reader software, i.e. all menu items and outputs can be read out to the user (must-have) The software does not have any complicated menus or graphical elements (must-have) Starting the recognition process must require as few steps as possible, ideally begin automatically on program start-up (should-have) The software is able to perform the recognition process automatically in real-time, i.e. several frames per second (could-have)
3.3
Software Architecture
In this section we will outline the basic program structure using descriptions in both written form and UML diagrams. The software will be organised into several modules that deal with the dierent program tasks. In more detail, the basic structure for any Symbian v9.2 application comprises of ve classes2 that are necessary to start up the program and draw the screen: Main The rst object called by the OS when starting up an application, creates a new application object and runs it Application Creates a new document object and returns a pointer to it Document Creates the application user interface (AppUi) object AppUi The application user interface handles all interactions such as pushing a key or selecting a menu item. It creates an AppView object (or multiple views) which is used for screen access. Here, it also creates a new MainController object which coordinates the image capturing and processing AppView The application view draws the screen to make information visible to the user To this skeleton, we add more classes for the image processing task:
The naming conventions for Symbian command that all class names start with Cxx, xx being the application name. In this case, all classes have the prex CMSP.
2
34
MainController Coordinates all image capturing and processing operations and returns the results to be displayed and read out be the UI, which promotes it to the AppView object ImgCaptureEngine Fetches an image by accessing Symbians camera API and returns it to the controller. ImgProcessor Takes the image provided by the previous module and performs pre-processing operations on it, then determines the presence of a sign and returns the results (sign present: yes / no, arrow direction) to the controller CFactorGraph Constructs the factor graph object based on the image segments and the cues dened in the next section CFactor Class for factor nodes in the graph CBeliefPropapation Performs belief propagation on the factor graph The structure shows that the modules can be designed to interact over clearly dened interfaces, which is crucial for the development process, as it will simplify separate implementation of individual parts and composition at a later stage. This will also help to optimise the performance for each module and to carry out precise testing. The complete organisation of the image processing system, i.e. the symbian skeleton and the classes created for the processing task, is shown in gure 3.1. It has to be noted that, due to the complexity of the classes, only the most important member data and methods are displayed in the class diagram. The ImgCaptureEngine class in particular shows an important feature of Symbian OS applications. The class is derived from CActive, which makes it an Active Object, a framework that allows for asynchronous programming. This construct comprises of the Active Object in the form of a class derived from CActive, and an Active Scheduler which is provided by the Symbian application architecture. Using an Active Object makes it possible to manage asynchronous functions, which means that a function returns immediately after calling it without waiting for further tasks to be executed. This is particularly useful for the task of capturing continuous frames from a camera: The system issues a request to capture a frame from the camera, which is done asynchronously while other tasks can be performed. Once the frame has been captured, the camera object issues a callback to its observer, which then initiates processing of the image.
35
Figure 3.1: Class diagram showing the organisation of the application classes
Figure 3.2 shows a state diagram of the basic processes that are executed when running the software, which will give a more detailed insight into the coordination of the systems components and functionalities. The application can be closed from every state by using the Exit menu option, or simply pressing the hang up key on the phone.
3.4. IMAGE PROCESSING METHODS AND ALGORITHMS
36
Figure 3.2: State diagram for the emergency exit sign recognition software
3.4
Image Processing Methods and Algorithms
An obvious approach to the task of recognition emergency exit signs would be to base the object detection solely on the images colour values. All escape route signs according to BS 5499-4 show white icons on a plain green background, which could make it easy to search for a rectangular green sign in the image. [FZB+ 05] achieved the highest and most ecient recognition rates by analysing the objects
37
colour. However, due to the varying (and previously unknown) lighting conditions and dierent phone camera properties, this method is not expected to yield adequate results for our project. In addition to other methods, analysing colour features could improve the recognition rate, which will be examined during the course of the project.
3.4.1
Edge Detection
The rst phase of the processing will consist of dierent pre-processing tasks, such as converting a colour image into greyscale, using an edge detection method to extract edges from the image and thresholding the results to produce a binary edge map. An edge map provides information about the estimated location of edges (i.e. region boundaries or contours) in an image, which are dened as changes in the image intensity. A drastic change in the intensity indicates a clear edge (for example, the border of a black object on a white background), whereas a more gradual change hints at a more blurred or softer edge. We decided to use the Sobel lter for edge extraction, which can be easily and eciently implemented using integer operations (multiplication and addition). Other edge detection methods such as the Laplacian were considered for this task, but deemed unsuitable due to the high sensitivity to noise. The Sobel lter uses the two 3x3 kernels shown in gure 3.3 for computing approximations of the horizontal and vertical derivatives of each pixel. -1 dx = -2 -1 0 0 0 1 2 1 dy = 1 0 -1 2 0 -2 1 0 -1
Figure 3.3: Sobel kernels used for horizontal and vertical derivatives
In order to reduce the edges that are extracted by the Sobel operator to thinner, 1-pixel-lines, non-maximum3 suppression is performed on the image. The idea of this operation is to reduce the visible pixels to the ones that are local maxima in their neighbourhood which is given by their gradient direction (the normal to the edge direction). Only if the intensity of a pixel is greater than the intensities of its neighbouring pixels along the gradient direction it can be considered a local maximum. This thinning operation is an important step in an
3
The term is used in dierent variations such as non-maximal and non-maxima suppression.
38
edge detector in order to ensure locality of the edge, which means that the edge is detected exactly at its location.
3.4.2
Extracting Straight Line Segments
The next step consists of detecting any rectangular object in the image that has the approximate dimensions of an emergency exit sign, regardless of the actual content. The Hough transform has been considered for this process, given its suitability for nding straight lines in an image. However, it was not possible to nd a simplied or approximated version of the algorithm that could be used for ecient implementation without oating point operations. This would cause a slowdown of the processing speed, which is clearly not desirable for the project. First, we need to extract straight line segments, which are then used as a starting point for detecting a rectangular structure in the image. This is achieved using a greedy bottom-up grouping procedure as suggested in [SC07]. The image is checked for vertical and horizontal edges separately. For the detection of horizontal lines, the method groups edge pixels that are already connected and form an approximately horizontal line into smaller segments. Small gaps between these segments are then lled if the segments are neighbours (within a certain region) and have roughly the same orientation. Those segments that are shorter than 20 pixels are then removed from the set, which eventually contains all horizontal straight line segments, represented by their start- and end points.
3.4.3
Detecting Rectangular Shapes
Suggestion for a Simplied Method Once the straight line segments have been detected, the system need to determine whether there is a rectangle (i.e. a quadrilateral shape that has roughly parallel opposing sides) present in the frame. It can be argued that a straightforward way of carrying out this task would be to simply check for overlapping (or nearly overlapping) start- and end points of horizontal and vertical segments. The following conditions have to be satised for the shape to qualify as a candidate rectangle: Take horizontal segment A with start point SA = (xsA , ysA ) and end point EA = (xeA , yeA )
39
The coordinates of SA are within the neighbourhood (minimal dierence in x- and y-direction) of the start point SB of a vertical segment B with a length shorter than A The coordinates of EA are within the neighbourhood of the start point SD of a vertical segment D shorter than A and roughly the same length as B B and D have opposite polarity, i.e. the gradient direction of B is the inverse of the Ds gradient direction Bs orientation is orthogonal to As (roughly orthogonal that is, within only a few degrees) Ds orientation is (roughly) orthogonal to As B and D are roughly parallel, i.e. the dierence between their gradient orientations is minimal The coordinates of EB are within the neighbourhood of the start point SC of a horizontal segment C The length of this segment C is similar to the length of A The coordinates of EA are within the neighbourhood of the end point SC of the same horizontal segment C Cs orientation is roughly orthogonal to Bs and D, and roughly parallel to As A and C have opposite polarity, i.e. the gradient direction of A is the inverse of the Cs gradient direction Please note that, in order to simplify the construction, start- and end points of segments are classied in a left-to-right (for horizontal segments) and topto-bottom (vertical segments) manner respectively, regardless of their polarity. With respect to perspective bias that aects the length of the segments (which, ideally would be pairwise of equal length) it can be assumed that exit signs are are approximately at eye level (or slightly above), which means that the perspective distortion is expected to be minimal. As for the length of the segments: On average, the width of an emergency exit sign that contains all three icons (the
40
word Exit, a running person and an arrow) has a width to height ratio of 2.8, i.e. the horizontal segments are nearly three times as long as the vertical ones. Assuming that the camera is close enough for the sign to ll out the full image width of 320 pixels (which is very unlikely), this means that the vertical segments are at most 118 pixels long. It also needs to be mentioned that some of these conditions can be omitted as they are the consequence of other conditions. For example, if two vertical segments of the same length begin at the start-and end points of a horizontal segment, then the horizontal segment at the bottom of the sign that connects their end points must be approximately the same length as the top segment (again, assuming that the perspective bias is minimal). Factor Graph Belief Propagation While the simplied method described in the previous paragraph seems easy to implement, it is the objective of this study to review more sophisticated approaches to the task of object recognition that are based on inferences rather than basic image processing on a pixel level. This is why we will now describe a solution based on the factor graph belief propagation method as explained in the previous chapter. This will allow us to analyse the segment groups with respect to multiple cues and perform rapid inference on them in order to determine whether a segment is part of a rectangle or not. Using the straight line segments extracted in the edge detection step, the factor graph is constructed with each line segment being a node variable in the graph. Based on the specication of factor graphs, the cues that the factor graph uses for this task will now be described in detail. It has to be noted that due to the characteristics of a rectangular sign, both horizontal and vertical straight line segments have to be analysed. In order to simplify this, the horizontal and vertical segments will be rst checked individually, then the candidates will be combined to look for matching 4-tuples (that is, one vertical and one horizontal pair). For all cues, the distributions Pon and Pof f (as explained in section 2.7.1) are determined based on training images. Unitary cues make a statement about single segments, regardless of their relationship with other segments. We will only use one unitary cue: Segment length: On average, the horizontal straight line segments at the top and bottom of the sign are long compared to other straight lines in the
41
image, whereas the vertical lines are relatively short (as previously mentioned). The binary cues describe a relationship between two neighbouring segments (two nodes) with opposite polarity: Parallelism: The dierence (its absolute value) between the orientations of two neighbouring segments is minimal. Proximity: For horizontal pairs, the distance between the two segments is usually relatively small (approximately one third of the segment length), whereas the distance between vertical segments is the inverse of this, i.e. roughly three times the length of the segments. Overlapping: The dierence between start- and endpoints of horizontal / vertical pairs is within a certain limit. The arity-4 cues take into account 4 straight line segments, that is, one horizontal pair and one vertical pair: Orientation: The average orientation of the horizontal pair is orthogonal to the orientation of the vertical pair. Corner points: The dierences between the coordinates of start- and end points of the four segments are minimal. Width to height ratio: As previously mentioned, the ratio of horizontal segment length to vertical segment length should be in the region of 2.8. After dening the cues that will be used to describe the relationships between line segments, the factor graph has to be constructed. Each cue corresponds to the factor (a local function) of a global function that describes the likelihood of a segment to be part of a rectangular shape. The rst two steps are carried out individually for horizontal and vertical lines, with candidate factors being generated for every segment or segment pair that meets the requirements. Beginning with the unitary cue, only segments of sucient length are considered candidates for the next step, and a unitary factor is constructed. Determined by the binary cues for parallelism, proximity and overlapping, all pairs of segments (with opposite polarity) that satisfy the criteria for those cues are selected as candidates.
42
This is followed by combining vertical and horizontal candidate pairs to check them for the arity-4 cues. The nal decision whether a node variable is in gure state (xi = 1) is then based only on its belief Bx . If Bx is suciently large, the node will be assigned gure state, if not, it will be set to ground (xi = 0). The simplied version explained in the previous chapter suggests that no message updates are necessary for an approximated result, which means that the belief propagation will already converge after one run.
3.4.4
Analysis of the Signs Content
If the decision is made that a rectangle is indeed present in the image, a sound is output in order to notify the user of this status. An image is then captured with a higher resolution and several pre-processing tasks are carried out to prepare the nal analysis of the signs characteristics. As we know the coordinates of the start and end points of the four rectangle sides, it is possible to enlarge the section containing the rectangle. In order to calculate the coordinates for the larger image (640x480 pixels), the coordinates obtained from the smaller image (320x240 pixels) simply have to be doubled. Assuming that the rectangle is not immediately surrounded by any clutter, the perspective distortion and rotation are minimal, and that the background colour is dierent from the green sign (which is necessary to achieve a high contrast so that the signs can be clearly seen inside buildings), simply clipping the image to its bounding box with sides parallel to the image borders would be sucient to isolate the sign from its surroundings. However, in order to achieve an accurate result for the ratio of sign background and icon pixels, the image needs to be freed from any perspective bias, rotation and background clutter. This can be achieved by projecting the presumably skewed and rotated quadrilateral shape onto a rectangle with parallel sides. This projective transformation then maps the points from the distorted image to the corresponding points in the rectangle. Using the four corner points of the distorted image and the target rectangle as reference points, a matrix for the transformation can be constructed as described
3.4. IMAGE PROCESSING METHODS AND ALGORITHMS in [Blo]. In this case, this projective transformation is dened as a11 a12 a13 u v w = x y 1 a21 a22 a23 a31 a32 1
43
(3.1)
Here, the matrix A is a non-singular (invertible) homogenous transformation matrix with eight degrees of freedom. The coordinates (x , y ) of the mapped point are given by x = u a11 x + a21 y + a31 = w a13 x + a23 y + 1 y = v a12 x + a22 y + a32 = w a13 x + a23 y + 1 (3.2)
This leads to a linear system with eight unkown coecients a11 . . . a32 that is solved using the point pair coordinates in order to determine the transformation matrix. The next step of the detection phase is then based on the histogram of the rectangular shape that was extracted from the large image. Knowing that there are only two colours present in the image, its histogram can now be examined. If the ratio of green (i.e. dark grey in the greyscaled image) and white pixels corresponds to the usual ratio found in emergency exit signs, the rectangle is marked as candidate for a sign and the nal outcome is determined by a matching procedure. This quick check reduces the overall computational costs by discarding all rectangular shapes that are highly unlikely to be emergency exit signs. Otherwise, the more expensive template matching procedure would be performed in vain. Due to a number of reasons, it was decided to reduce the step of conrming the presence of an emergency exit sign and recognising the direction of any arrow to a simple pixel matching method: Firstly, and most importantly, we are dealing with standardised signs that dier only in the direction of the arrow and, accordingly, the orientation of the running icon. The directions from -90 to 90 are place on the right, the directions top left, left and bottom left are located on the left hand side of the sign (as explained in section 1.4). Secondly, the location of the sign is known, dened by its corner points. Thirdly, the image has already been freed from perspective and rotational bias in the previous step. And nally, the number of dierent arrow directions is reasonably low (eight: The four main directions plus the diagonals), which means that in the worst case the entire sign
3.5. TRAINING IMAGES
44
has to be checked only eight times4 . While the operation is not expected to be the most ecient method, it is yet straight forward to implement and does not cause any complex overhead.
3.5
Training Images
The amount of training images necessary for training the system varies heavily depending on the algorithm that is used for classication. AdaBoost performs well using a large set of training images ([PTAE09] mentions up to 10,000 negative and 500 positive samples from an existing database).. The number of training images that are necessary for factor graph belief propagation is relatively low: [ICS08] uses 25 positive and negative images each, which still yields high recognition rates and seems more feasible. In order to determine the distributions for the dierent cues, training images are captured with a mobile phone camera and then labelled manually. It is particularly important to pay attention to images that could cause false positives due to their similarity to emergency exit signs, were taken under dicult lighting conditions, contain perspective bias or partly occluded signs. Examples of images taken with a mobile phone camera are shown in 3.4. Starting with the top left image, these pictures show some of the most common problems with camera phones. The images point out some variations of the standardised emergency exit signs that will not be recognised by the system, such as the text Fire Exit on the sign instead of Exit as specied by BS 5499-4, which we will be using as sample templates: The angle between camera and exit sign is wide, lights cause reections on the sign. Blurring due to fast camera movement. Here, the text on the sign is FIRE EXIT in capital letters, which is expected to cause problems when applying a template matching strategy. The distance between the camera and the sign is very far, the sign appears small and blurred. The sign is placed next to lit signs, which also causes reections and overexposure. In this picture, the sign is also made up of two plates (arrow on the
4
It has to be noted that there are no clear denitions for the actual de
3.6. ISSUES AFFECTING THE SYSTEM PERFORMANCE
45
left, icon and text on the right) and will not be recognized by our method.
Figure 3.4: Four examples of emergency exit signs captured with a phone camera
3.6
Issues Aecting the System Performance
Some of the challenges that mobile image processing software for visually impaired users has to deal with are characterised in [DLQ+ 06]. The problems are specied for text recognition system based on a client-server architecture (see above), but can also be applied to the issue of recognizing emergency exit signs. The application has to process images that are blurred contain text that is very small (or in this case, small exit signs) have low contrast were taken under poor lighting conditions
46
These issues have to be considered when designing the image processing application, as well as producing the sets of training and test images. Since there is no way of improving the camera quality, these errors can only be mitigated by choosing image processing methods that do not rely too heavily on awless image quality. There are also a number of critical issues that have to be dealt with when implementing the software, regarding both the implementation process and the actual problem of object recognition. First, the existing resources of computer vision libraries on Symbian OS are relatively small compared to other platforms such as Windows PCs. This makes it necessary to implement a large amount of functionalities from scratch or port them to Symbian C++, which is an errorprone procedure that slows down the development process. This risk could be reduced by using as many existing building blocks as possible and keeping to the principles of good coding practice for Symbian C++, as dened in on the Forum Nokia website5 . Secondly, as previously mentioned, the low processing power of smartphones and a software-emulated oating point unit require careful memory management and choice of data types. In order to deal with this problem, oating point operations have to be avoided where possible in favour of integer operations.
3.7
Chapter Summary
This section outlined the main aspects of the software development process. We specied the requirements for the application and gave an overview of the program design which is based on the typical Symbian application structure. It was decided to organise the application into several modules with dierent functionalities that interact over clearly dened interfaces. We then gave an overview over the methods and algorithms that will be used for the image processing module of the software and specied details of the factor graph belief propagation, along with a proposal for a simplied version of the detection stage. The chapter was concluded by a discussion of the quality and quantity of test images, as well as an overview over typical problems that will have to be dealt with in the implementation phase.
http://www.forum.nokia.com
Chapter 4 System Implementation

4.1 Overview
This main objective of this chapter is to give an insight into the implementation phase of the project, based on the methods discussed in the previous chapter. First, we will give an overview over the implementation tools that were used, followed by a description of the dierent stages of the application development. This will include explanations of the implemented algorithms, along with short code listings of the most signicant program segments where deemed necessary for understanding. The chapter is concluded by an explanation of the characteristics of Symbian OS with respect to methods for optimising the application performance on this platform.
4.2
Implementation Tools
Symbian provides software development kits for the dierent OS versions, with S60 3rd Edition (Symbian OS v9.1) and S60 3rd Edition FP 1 (Symbian OS v9.2) being the ones supported by the largest number of devices (mainly Nokia and Samsung) [Sym09]. The SDK comes with all the necessary C++ APIs, example programs and a phone emulator (which is of no use for this project, as the camera on the handset cannot be simulated by the emulator using a built-in laptop camera). In order to assist the application development process, Symbian recommends using an IDE such as Carbide.c++. This free software is based on the Eclipse IDE and oers tools for debugging, on-device debugging and GUI construction. For this project, the software was run on a Mac OS X system 47
4.3. IMAGE CAPTURING
48
using a virtual machine (VirtualBox) with Windows XP Professional as a guest OS. Compilation of the application code (using the GCC-E compiler) produces a Symbian installation le (.sis) that can be installed on any suitable Symbian device. In order for the le to be accepted by the device, it has to undergo a signing process. This is achieved using command line tools provided by the SDK to generate a key and a certicate, which are used to sign the .sis le after compiling it. The signed application1 is then installed by either directly connecting the phone to a PC via USB and initiating the installation process from the development system, or transferring the .sis le to the handset (e.g. sending it via bluetooth) and then installing it. The IDE also oers a mode for on-device debugging when the device is connected to the host computer via USB, which proved to be useful for debugging purposes.
4.3
Image Capturing
The phones camera is accessed using the camera API to capture images over the phones viewnder. The images are transferred directly to a bitmap without any further processing. The advantage of this method over capturing an image is the speed of the operation. In this viewnder mode, the N95 camera produces images with a size of 320x240 pixels in 32-bit colour mode, which are then used for carrying out further pre-processing steps. In order to capture a higher resolution picture of 640x480 pixels, the camera viewnder needs to be stopped when the capturing buttons is pressed. The camera settings are then changed to a higher format, the image is captured and displayed on the screen. Figure 3.2 shows a state machine diagram of the camera module in interaction with the other system components. In tests with the Nokia phone, the autofocus which is run automatically by the operating systems controls proved good enough to produce pictures of sucient quality with little blurring, which makes adjusting the camera focus by hand unnecessary. This can be considered a very helpful feature of the built-in camera API, given the system is designed for blind users who will not be able to adjust any camera settings to improve the image quality.
It has to be mentioned that any self-generated key and certicate pair is only valid for a certain period of time, usually one year. After that, the .sis le is rejected by the phone and has to be signed again with a newly generated key and certicate.
1
4.4. PHASE ONE: FEATURE EXTRACTION
49
4.4
Phase One: Feature Extraction
Tests were carried out with a Symbian OS computer vision library in C++, developed by Nokia (NokiaCV2 ), that provides an implementation of various image processing tasks. Due to very slow performance (3 seconds per frame for greyscale conversion and convolution with a Sobel lter), this approach was deemed rather unsuitable for real-time processing. Therefore it was decided to implement all processing steps using only the bitmap interface that is oered by Symbian and simplies drawing the captured images to the phone screen. While accessing the individual pixels over the interfaces GetPixel() method oers a convenient way of manipulating the colour values, this proved too slow for ecient implementation of complex image processing algorithms in several loops over the image. All pixels were therefore accessed through a pointer to the bitmaps rst pixel, using the bitmap interfaces DataAddress() method.
4.4.1
Greyscale Conversion
In a rst pre-processing step, feature extraction was performed on the input image. The rst stage included converting the bitmap delivered from the camera from colour into greyscale mode3 . As previously mentioned, the input image on the test device is a 32-bit RGB + alpha channel bitmap4 . The conversion function takes the input bitmap and simply draws it onto a new bitmap that was created in greyscale mode.
4.4.2
Sobel Operator
The Sobel operator for edge detection is implemented using simple integer multiplication and addition to convolve the image pixels with the horizontal and vertical kernel. The gradient magnitude is then calculated using the sum of the derivatives absolute value Abs(dx) + ABs(dy) as an approximation, rather then the hypotenuse, in order to avoid calculating the square root which would have an impact on the system performance. In the case of an implementation for Symbian OS, attention has to be paid to the correct usage of the integer datatypes that are
http://research.nokia.com/research/projects/nokiacv/ Due to the mode being named EGray256, the US English spelling was used throughout the source code for consistency reasons. 4 In fact, the colour mode delivered by the N95 is EColor16MU which is built up as BGR.
2 3
50
oered by the platform (several dierent signed and unsigned integer with various lengths, such as TUint8 for unsigned 8-bit integers), and their explicit conversion when assigning values. Even without any prior (possibly time-consuming) smoothing, this operation produced results that were suitable for further processing. Listing A.1 shows how the sobel operator is applied to the image, with subsequent normalisation of the resulting gradient values to the range 0..255.
4.4.3
Non-Maximum Suppression
In the systems implementation, the non-maximum suppression is performed by determining the gradient direction of each pixel and comparing it to the two neighbouring pixels in the positive and negative edge direction (normal to the gradient direction). The gradient direction is dened as = arctan(dy/dx), however, this expensive operation was not suitable for an ecient implementation as it already slowed down the performance to 1 frame per second. Therefore, and since we operate in a discrete domain, the gradient orientation has to be classied into one of the eight main directions which we hard-code. There are several ways of carrying out this classication without directly computing the gradient orientation: One method is based on the signs of the horizontal and vertical derivatives, which classies the pixel into one of the directions 1 to 7: Direction 1 covers 0 to 45, 2 ranges from 45 to 90 and so on. The gradient magnitude is then compared to the linear interpolated gradient values of the pixel pairs (in negative and positive gradient direction) in the discrete grid that are closest. A suitable threshold was determined through testing, with results varying depending on the light conditions and the distance from the object, as it had been expected. This non-maximum suppression leads to a binary edge map with thin edges. The results of the individual edge detection steps as shown in gure 4.1 show a comparison between a simple thresholded image and the image after applying non-maximum suppression, which clearly demonstrates the importance of this operation.
4.4.4
Straight Line Extraction
In order to extract straight line segments from the image, a greedy grouping procedure is applied to the edge pixel. The method (here explained for horizontal segments) scans every row and proceeds as follows: If the current pixel is an edge
51
Figure 4.1: Individual steps of edge detection: (a) Original image (top left), (b) Sobel ltered (top right), (c) Sobel and threshold (bottom left), (d) Sobel, non-maximum suppression and threshold (bottom right)
pixel (i.e. not zero), check its neighbouring pixel within 0, 45 and -45. If one of these is also an edge pixel, set the current pixel as starting point for a stroke. Then proceed to check this edge pixel for its horizontal neighbours and continue until an edge pixel is met that does not have any neighbours to the right. This last pixel is then set as end point of the stroke and the algorithm continues to process the starting line. If the starting pixel does not have any edge pixel neighbours, discard it in the target image. To connect the shorter collinear strokes to longer segments, a small (5 pixels) window is moved over the end points in order to detect start points that are located within 45 (positive or negative) of the end point. Given the start- and end coordinates, we can also infer all the information needed for the factor graph, namely the length of the segment, its position and its orientation (slope). This information is then stored in an array of TPoint objects created to represent line segments, with consecutive elements being regarded as neighbours, i.e. candidate pairs for the binary cues used in the factor graph.
52
4.4.5
Factor Graph and Belief Propagation
After having extracted straight line segments from the image, the factor graph is built based on those image features. This section will explain the way to implement a factor graph and represent the cues listed in the previous chapter. It has to be noted however, that the factor graph belief propagation has not been fully implemented and that the steps described in this section will only act as a pointer to the actual nalised implementation. The implementation is largely based on the libDAI library and Intels Probabilistic Network Library (PNL), two open source libraries for inference on graphical models in C++, which provide a good starting point for porting the algorithms to the Symbian platform [Moo, Int]. To build up the factor graph, some helper classes are needed that provide a data structure for the dierent pieces of information stored within the graph. A class for the individual factors called CFactor holds a set of variables (the arguments of the function) and a reference to a probability vector as data members, along with methods to manipulate this data. The probability vector describes the value of a factor depending on all possible variable assignments. With respect to an ecient implementation, both this vector and the set of variables are constructed using Symbians RArray template class, a wrapper for accessing arrays of structures and objects5 . Corresponding to the denition of factor graphs in the previous chapter, the factor graph class CFactorGraph has an array of variables (that are either one or zero) and an array of CFactor objects as data members. Edges in the graph are represented by an array of factor neighbours for each variable, and an array of neighbours that are variables for each factor node6 , in order to distinguish between the dierent types of edges. An edge is then added by including the variable and factor involved in the respective set in the graph and adding entries to both of the neighbour lists. Accordingly, in order to remove an edge, the entries from the neighbouring arrays are deleted (which is also done if the respective variable or factor nodes are removed from the graph). In this context of image processing, each edge between a factor and a variable corresponds to a cue used to determine the state of a segment (the variable that the factor is connected to). In order to compute the MAP, the belief propagation is now performed on
In this section the terms list and set are only used for legibility reasons and do not suggest the use of the respective data structures. 6 As we are dealing with a bipartite graph, it is ensured that there exist only edges between two dierent types of nodes.
5
4.5. PHASE TWO: OBJECT RECOGNITION
53
the factor graph. A BP class is generated for this task which holds objects for the messages passed between the graph nodes as member variables, along with methods for creating and updating messages. When creating a BP object, it is initialised with the factor graph that the operation is carried out on. The computations for the segments beliefs are then computed based on the algorithms explained in the second chapter. By using the simplied version of the factor graph belief propagation algorithm, no message updates are necessary in order to determine the belief for the segments in the image. The segments that are regarded as not belonging to a suitable quadrilateral shape are then saved in an array of four TPoint objects that mark the start- and end points of the segment in clockwise direction (top, right, bottom, left). If several quadrilateral shapes are detected in the image, the one with the highest belief is rst analysed with the warping and template matching procedures described in the next section and, if the output is negative, the steps are repeated for the other detected shapes.
4.5
4.5.1
Phase Two: Object Recognition

Final Content Analysis
Once we have obtained the coordinates of the rectangles corner points we can proceed with the analysis of the rectangles content. First, the captured image drawn to a bitmap in greyscale mode, again using Symbians bitmap API. This is followed by the planar homography that projects the found quadrilateral shape with perspective bias onto a rectangle, using a transformation matrix that is constructed on the basis of the four corner points of the distorted and target rectangle respectively. The implementation of this projective transformation is based on the code described in [Blo], which has been ported from C to Symbian C++. The transformation matrix that is determined in the rst stage is then used to compute the corresponding pixel in the quadrilateral for each pixel in the target rectangle. When implementing this method, particular attention has to be paid to optimising the multiple divisions that occur when computing the transformation matrix and the nal output by using xed point arithmetic with integers rather than Symbians TReal class for oat values and standard division. For the next step (determining the ratio between green background pixels
4.5. PHASE TWO: OBJECT RECOGNITION
54
and white icon pixels) a very simple histogram analysis procedure was implemented: The clipped images greyscale values are compared to a lower and upper boundary value chosen through testing (the average greyscale value of the background green is approximately 100) near the expected background colour in order to separate the green background from the white icons. The ratio of pixels that are within the boundaries (i.e. green pixels) to the number of pixels that are above the upper boundary (white pixels) has to be close to 1.1, which has been determined through testing. If this quick check produces a positive result, i.e the found rectangle is a candidate for an emergency exit sign, the template matching is performed. For this purpose, eight exit sign templates are created as binary images, with the same dimensions as the target rectangle used in the projecting step. The pixels from the template and the thresholded image are then pairwise compared and their dierence is summarised. If the sum (i.e. the dierence between the two images) is minimal, it is assumed that the sign matches the respective template. The arrow direction is then saved as one of eight directions and output by the system. Figure 4.2 shows that even templates with the same layout (text, icon and arrow from left to right) clearly dier in the relatively large white section that denes the tip of the arrow, which is how they can be distinguished.
Figure 4.2: Two examples of binary sign templates
The templates are created as bitmaps and then referenced in the projects MMP le that includes project specic instructions for the compiler, such as libraries that need to be included. The bitmap is then integrated into a Symbian multi bitmap le (.mbm) during compilation and can be loaded using its automatically generated le name or its enumeration index in the source code. Listing A.2 demonstrates how the the candidate rectangle is matched with each of the eight templates in the .mbm le.
4.6. RESULT OUTPUT
55
4.6
Result Output
The result of the object recognition procedures carried out in the previous stages need to be output in a way that is suitable for visually impaired users. As previously mentioned, the signal tone in the rst stage (nding a rectangular shape) is a simple bleep sound which is produced using the Symbian library for system sounds like warning and error messages. Since the sound will be repeated for each frame in which a rectangle is recognised, this is the least obtrusive way to notify the user of the presence of a potential sign. Once the image is captured and processed in the second stage, the result is output as text, using Symbians CAknInformationNote pop-ups that display text for several seconds. The text is then picked up by the screen reader that is installed on the device and read out through the screen reader softwares text-to-speech synthesis. The output informs the user whether an exit sign could be detected in the image or not, and gives the arrow direction if an arrow is present: No exit sign found. Emergency exit sign found. No arrow detected. Emergency exit sign found. Arrow direction: Top right. These messages can be displayed again by pressing any key on the keypad, which adds to the usability of the application.
4.7
Optimisation for Symbian S60 devices
For applications that update the screen at short intervals, as it is done here to display the processed image, it is recommended to bypass the window server that scales and clips the image before it is drawn on the screen. Symbian provides direct screen access over its CDirectScreenAccess interface. While this would not be important in a nal application that does not display the camera image, it could speed up and give a more accurate impression of the system performance in the development stage, where the screen output is necessary for debugging purposes. Since several copies of the processed image are being held in memory as simple arrays, it is important to increase the applications heap size before allocating memory for the image data. The SDK oers a way of setting a minimum and
56
maximum size, for which a check is performed before starting the application. In order to allow for sucient memory, the maximum heap size was changed from the default 1MB to 4MB. This solved the problem of application crashes caused by accessing pointers that were initialised to NULL due to a lack of memory. As suggested in [ICS08], the workload during runtime can be reduced by using static arrays rather then dynamically allocated memory. Due to the fact that the images used in the application always have the same size (320x240 pixels and 640x480 pixels respecitvely for the images captured from the camera), a suciently large array can already be constructed during compile time.
4.8
Chapter Summary
This chapter discussed the application development process and the implementation of complex processing tasks on the Symbian OS platform. The development was carried out using the Carbide.c++ IDE provided by Symbian. The application makes use of the platforms camera API to capture continuous and still images which are passed on to the processing module. The image processing is then performed in two stages, with greyscaling and edge extraction being applied to the image in a rst step, followed by the object detection core. After completion of the implementation stage, the program will be tested and evaluated, which will now be described in the following two chapters.
Chapter 5 Testing
5.1 Overview
In this chapter, we will outline the testing and evaluation procedures that are being performed throughout the development process and when examining the nal version of the application. This also includes a discussion of the expected and desired test results, which dene the criteria for success of the project. The chapter is concluded by an overview of the testing results with respect to speed and recognition rates of the application in dierent testing environments.
5.2
5.2.1
Description of the Testing Procedures

Ad-hoc Testing
Testing took place throughout the dierent stages of the application development in order to ensure adequate performance and recognition rates when implementing new functionalities. The testing includes checks for both the performance of the image processing module and the robustness of the software. Regarding the correctness of the software, we carry out informal tests for each completed unit and module integration step. This will deal with code coverage criteria in particular, in order to ensure that all statements in the code have been executed and tested for validity, and do not contain any bugs or errors. Ad-hoc tests are performed conveniently on the Symbian SDK emulator, while the more critical tests are carried out on the actual handset. It has to be noted that accessing a camera (such as the development systems 57
5.2. DESCRIPTION OF THE TESTING PROCEDURES
58
built-in laptop camera) is not possible when using the emulator. This makes it necessary to test on static images captured with a phone camera, which means that only the processing results but not the performance of the system can be determined. Due to the movement of the camera, the results are also expected to be less accurate on the live system. It was tried to compensate these circumstances by manually adding noise and motion blur to the static testing images that were used with the emulator. By carrying out testing on the handset it is also ensured that the program can be installed and runs on the intended device. On the completion of every module, the prototype is be tested for compliance with the requirements dened in section 3.2. If necessary, this triggers a revision of the code and the modication or addition of functionalities.
5.2.2
Final Testing
In terms of object recognition performance, we aim at a relatively high rate of true positives and a low number of false positives, in combination with a short processing time. Since imperfect results are more acceptable for users than long latency [BB08], we focus mainly on the eciency of the application, i.e. fast execution of all program functions. If the desired results for the tests are not achieved, the code is reviewed in order to improve the performance. Early tests are carried out on a large set of positive and negative images in order to determine the performance of the chosen object recognition methods, with the nal testing being performed live in buildings on a smaller test set. It is also desirable to have the software tested for usability (ease of use) and evaluated by users who are unfamiliar with the system however, due to the restricted functionality of the application, this is not of highest priority. Live tests with users were not carried out due to the system not being in the nalised state in this stage. However, due to the minimal interaction necessary for running the system, user tests are only expected to provide feedback regarding the performance of the system rather than the actual usability. This is why user tests were not considered absolutely necessary when evaluating the system in its current state.
5.2. DESCRIPTION OF THE TESTING PROCEDURES
59
5.2.3
Desired Results
Due to the complexity of the Symbian platform and the limited amount of applications that can be directly compared to our project, gures regarding expected results can only be estimated based on related works. A quick experiment with a stopwatch shows that a comfortable time to pan a phone from left to right, i.e. approximately 180 degrees, is about 10 seconds. This gure can act as a guideline for determining how many images have to be captured and processed within 10 seconds in order to cover the whole area of 180 degrees around the user. With an average angle of view of roughly 40 degrees for standard phone cameras (dened by the focal length of the lens), we can infer that the system needs to be able to capture at least 5 images in those 10 seconds (i.e. 2 seconds per image) to provide comfortable use of the system. The projects mentioned in section 2.5 in fact conrm the experiment and give a rough estimate for suitable recognition rates: The processing time for a single frame is less than 2 seconds. Automated processing in real-time (as demonstrated in [ICS08]) is highly desirable, but depends strongly on the implementation and will therefore only be regarded as an additional feature for this project. The recognition rate (true positive) lies at approximately 75% of all test images. The amount of false positives has to be treated with particular care, as it is unacceptable to send the user in the wrong direction. Therefore, no more than 1% of the test images should be incorrectly classied as a sign. These are only basic requirements that give an overview of the most signicant aspects of the nal system. However, the focus of the project lies primarily on the structured realisation of the proposed system design. It is deemed obvious that an ideal, i.e. highly ecient and correct system can only be achieved through profound knowledge of the platform, along with multiple code revisions and sucient time for exploring dierent approaches to one problem.
5.3. SYSTEM PERFORMANCE EVALUATION
60
5.3
System Performance Evaluation
While the overall execution speed of the recognition application is a major aspect in evaluating the system, the rst phase of the recognition that detects rectangular (quadrilateral, that is) shapes is considered particularly critical, as it is performed in real-time, while the user is panning the phone. The rst step, Sobel ltering and non-maximum suppression, achieved roughly 3 frames per second, when tested on the Nokia N95. This can be considered as real-time performance and lies within the expected time frame. The straight line extraction did not yield optimal results in live-tests on the device due to noise, motion blur and varying lighting conditions; therefore the execution speed was not measured. In tests on the SDKs emulator, the straight line detection suered from wide gaps between shorter segments that could not be closed. The edge detection algorithm clearly needs optimising in order to deal with those issues, which can be achieved through extensive tests to adjust the chosen thresholds. As the factor graph belief propagation algorithm was not far enough implemented to allow statements about its performance, we can only give rough estimates based on the sources that served as the foundations for this method. It is expected to perform close to real-time performance, however slower than stated in [ICS08] due to the higher number of arity-4 cues used in the factor graph. The ensuing warping procedure which involves a large amount of divisions will slow down the system performance if not optimised for xed point calculations, which is crucial for this operation. This procedure could be omitted by restricting the number of recognisable signs to those that lie within a certain angle from the camera and thus do not suer from signicant distortion. However, due to the minor dierence between the templates (for example, arrows pointing to the top and bottom only dier in the tip of the arrow), as well as blurring and noise in the image, it proved dicult to chose a suitable threshold to decide between the outcome without causing too many false positives or false negatives when matching images that had not been warped. Finally, the template matching for eight dierent templates is carried out with suciently high performance (under one second in tests on the phone), as was expected for a small number of binary templates. In this stage, the worst case is that all eight templates need to be checked, while in the best case the rst template already matches the input image, which reduces the processing time. Regarding the overall outcome it can be stated that the performance of the
61
system was not as ecient as expected, which is believed to be caused by the nonoptimised and rather straight forward implementation of the proposed methods. However, these optimisations are considered only a matter of following standard procedures, which does not aect the general feasibility of implementing the chosen method on the Symbian platform.
5.4
Chapter Summary
In this chapter we outlined the testing procedures that were carried out throughout the development process to ensure the quality and performance of the produced application. Both the eciency and eectiveness of the object recognition module were evaluated, along with standard software testing procedures that were carried out in order to check for the correctness of the application. Based on the system performance it was followed that optimising the application with respect to Symbian guidelines is key for achieving an ecient implementation.
Chapter 6 System Evaluation

6.1 Overview
After specifying the application design and outlining the implementation process in previous chapters, we will now review and discuss the overall project development. Firstly, the chosen method to the given task of recognising exit signs will be critically analysed, along with the system design. This is followed by an evaluation of the project in relation to existing work, where the advantages and disadvantages of the chosen method will be discussed. The chapter is concluded by a critical review of the project schedule and possible improvements that can be made to the system that was developed.
6.2
Analysis of the Research Methodology
It may be argued that the chosen methods for the feature extraction and object detection stages were not the most suitable for this task regarding eciency and ease of implementation. However, due to the vast number of dierent and varied methodologies and algorithms in this eld, it can only be stated that the preceding research was carried out carefully, which lead to the decision to choose an approach based on the work most similar to the given task that indicated successful results. Another positive aspect of this method is the small amount of training data that is needed for determining the cues that the factor graph is built on. This is a great advantage over methods such as Adaboost that need large amounts of labelled training images, sometimes up to several thousands, in order to produce good results. While the decision to base the application core on 62
6.2. ANALYSIS OF THE RESEARCH METHODOLOGY
63
statistical inference using a method from machine learning is an interesting and still rarely used approach to image processing on mobile platforms, the complexity of it proved to be a drawback of this approach. Especially with respect to the use on a restricted platform such as Symbian OS, nding a suitable way of eciently implementing the construction of a factor graph and inference using the belief propagation algorithm was not feasible given the time constraints. With respect to the specied task, the project can therefore be regarded as unsuccessful. However, in order to compensate for this issue, alternative approaches to the problem were explored, that would simplify the implementation and could still achieve acceptable results. With respect to Symbian OS smartphones as the chosen platform for this application it can be said that there are hardly any alternatives for developing a complex application such as the emergency exit sign recognition system. Due to the large range of available handsets, along with screen reader applications for visually impaired users and the extensive resources for developers, the system can be seen as superior with respect to the suitability for this task. One of the advantages of the developed system design is its modularity. By restricting the rst step of the recognition phase to rectangular objects, the program can be extended to recognise other standardised signs by using dierent templates in the second stage. As the nal decision for the content of the sign is delayed until several checks give a clear indication for the result, the recognition rate and especially the number of false positives can be optimised. The system does not rely on manually highlighting any regions of interest in the image or markers that help locating the signs and is therefore deemed more exible and user friendly than previous approaches to object detection on mobile devices using touchscreens. The recognition software is designed to be accessible by a screen reader, which is very likely to be already installed on a blind users smartphone. As the program only outputs information as text over Symbians built-in notication interface, it is guaranteed to work with any type of screen reader or display magnifying software. This is a great advantage over systems that show information over graphically oriented interfaces. In addition, by steering away from including speech output into the system, it can also be easily extended to provide more information to the user, without having to produce new sound les. This is also an advantage when considering developing a multilingual version of the software. The fact that the voice output of the recognition system is the
6.3. REVIEW OF THE PROJECT PLAN
64
same as the general text-to-speech voice used on the phone can also add to the users acceptance of the application. Finally, both the installation les (.sis) and use of system memory during program execution are kept relatively lightweight when favouring written text output over speech output that is included in the application.
6.3
Review of the Project Plan
The project schedule was designed before the start of the implementation stage of the project and was structured into three main stages. It allowed for a relatively long phase (one month) of getting familiar with the chosen platform and its programming language, Symbian C++. This stage was followed by the implementation of the program core, which was supposed to take another month. The nal stage (one month) would be application testing, evaluation and writing up of the insights gained during the course of the project. While this plan seemed adequate given the complex platform and the dicult task of optimising the system for real-time performance, it did not leave much room to deal with problems caused by implementation errors. This fact was worsened by the dierent error handling procedures on the Symbian emulator and the actual device, along with aws of the IDE and SDK tools such as non-transparent caching mechanisms, undocumented emulator crashes and debugging facilities. While the rst stage was completed in a shorter period of time than scheduled, the main implementation stage suered from the aforementioned problems. This lead to delays which made it necessary to cut down the task to a simplied version of system, as well as reduce the time scheduled for testing and evaluation. The conclusion that can be drawn for future projects is to arrange enough time for troubleshooting when dealing with unknown platforms. When working under strict time constraints, stepping back to the research phase to develop an alternative route is too time consuming to keep up with the project plan. Despite the incomplete implementation, the research work carried out for this project and the application design based on this research are still regarded as an adequate and convincing approach for solving a complex problem on a platform with limited resources.
6.4. IMPROVEMENTS
65
6.4
Improvements
As mentioned previously, the original task of implementing a recognition system based on a statistical method like factor graph belief propagation had to be reduced to a simplied solution to the recognition problem. Thus, the most obvious improvement would be to include an implementation of the factor graph belief propagation method outlined in 3.4.3 into the nalised software. Based on evaluation results from [ICS08], this is expected to improve both the processing speed (as the template matching phase is abandoned), as well as the recognition rate by relying on multiple cues, therefore removing a number of false positives from the ndings. In order to improve the recognition rate for exit signs that dier slightly from the ones shown in 1.1, the templates for the nal stage could be split up into their three components. For example, a variation of the signs shows the words FIRE EXIT in capital letters, which would not exactly match the template. In order to deal with this seemingly minimal dierence, the matching algorithm could simply look for the running person icon in the centre of the sign (only two templates) and then perform matching for the three (icon facing the left) or ve (icon facing the right) arrow templates. This method would simply ignore the presence of text in the sign, but the uniqueness of the icon and arrows are expected to already guarantee correct results, and it would speed up the matching performance by reducing the template size and number. With respect to the implementation on the Symbian OS platform, it can be stated that the produced code still needs to be optimised in some areas. In particular, the memory management can be improved by paying more attention to the careful use of system memory, as well as using simplied or approximated algorithms. As the image processing phase does not return to capture the next frame until the processing is completely nished, it would have also been useful to implement the CMSPImgProcessor class as an Active Object (see section 3.3). This would have allowed to perform both the image capturing and the asynchronously so that the next image is already fetched and prepared for processing while the previous frame is still being analysed. It is obvious, however, that the purpose of this study was mainly to carry out research into the topic and demonstrate a possible implementation, rather than producing a highly optimised piece of software for a rather unknown platform. This also explains why this report does not discuss the interaction of the recognition software with other phone functions such
66
as incoming calls, text messages or other applications running in the background. Of course, those features and events have to be considered when implementing applications for mobile platforms outside an academic environment.
6.5
Chapter Summary
This chapter discussed the success of the study by evaluating it with respect to the chosen solution to the problem of image processing on mobile platforms. This included a review of the chosen approach which led to the conclusion that the research methods used for the application were appropriate for the given task. This was followed by a critical analysis of the project plan and suggestions for improvements that could have been made to the system and overall development process if time constraints would not have applied.
Chapter 7 Conclusion and Future Work

7.1 Project Summary
This report has depicted the process of developing an image recognition system on a mobile platform which assists visually impaired users in nding emergency exit signs. In the introduction we gave a description of the motivation behind the study which is to make use of mobile phone technology as assistive devices for visually impaired persons, and to carry out research into the feasibility of image processing on mobile platforms. The systems main objectives were given as a sample ow of events when the application is used by a blind person to detect an emergency exit sign. The rst task was to decide on which smartphone platform the software was to be developed. Dierent platforms were discussed with respect to their processing power and ease of developing applications, and it was decided that the software was to be developed for Symbian OS smartphone models using its native programming language Symbian C++. The Symbian S60 platform in particular was deemed most appropriate due to its popularity and wide use on some powerful devices such as the Nokia N95. We then gave an extensive review of related work that made the dierent approaches to the problem of image processing on devices with restricted computing power the subject of discussion. The dierent methods can be grouped into server-client structures on one hand, where the captured image is sent to a server for processing, and on-device processing on the other hand, out of which the studies using factor graph belief propagation seemed the most successful and ecient. Due to long le transfer times and possible lack of network connection, 67
7.1. PROJECT SUMMARY
68
the server-client approach was deemed unsuitable for the given task. In the ensuing chapter a high-level description of the system architecture was given in order to provide the reader with an overview over the most important points in the development process. The application itself has been organised into modules, each of them with a dierent function, that are able to interact over clearly dened interfaces. The softwares structure and behaviour were described using both text and appropriate UML diagrams. In this chapter, we also proposed a simplied version of the rectangle detection method, as well as a description of the more sophisticated belief propagation. The software implementation was completed using the Carbide.c++ IDE, provided by Symbian. It uses the cameras API to capture both still and continuous images that are then processed. In the rst step, the image was converted to greyscale and an edge extraction lter was applied, which produced an edge map. The actual object detection was then carried out using factor graph belief propagation, a message passing algorithm on a graphical model that computes the belief of an image segment as the likelihood of it being part of the gure (as opposed to the background). The nal decision whether an emergency exit sign was present in the image was then based on the (greyscale) histogram of the thresholded sign and a template matching procedure. Testing was carried out throughout the whole development process, as well as after completing the implementation phase. It was essential to test both the quality of the application (identifying the exit signs in various situations and from various angles) and the performance of the processing module: Given the limited processing power of mobile phones, can the image processing and identication be run quickly enough? The necessary testing procedures were explained in the respective chapter, along with an outline of the available test results. Finally, a review of the application design and development process, along with suggestions for possible improvements were given in the previous chapter. The evaluation of the project was important to demonstrate the understanding of the topic and the ability to critically analyse the work carried out for this study. This dissertation discussed and combined methods taken from a number of dierent research disciplines, such as signal processing, statistics and software development for mobile platforms. This makes it a valuable piece of work that, while providing an extensive review of the dierent areas and their application
7.2. FUTURE WORK
69
for image processing tasks, may also function as a starting point for further exploration of the aforementioned topics. While not all of the main objectives were achieved, the signicant amount of research carried out for this study, as well as the clearly laid out methodologies, the structured system design and the exploration of dierent approaches to the problem demonstrate the general feasibility of the task based on the chosen solution. As the use of factor graph belief propagation for image processing tasks on mobile platforms is yet to be comprehensively investigated, it is strongly encouraged to carry out further research based on the conclusions drawn from this work.
7.2
Future Work
In order to make the systems output even more useful and accurate, the text on the emergency exit sign could be analysed in addition to the other features that have been discussed in this study. This could be achieved using the OCR1 API provided by the Symbian operating system. After detecting the section of the sign that contains the text, the methods oered by the API take the bitmap and information about the text region (bounding box, background colour) and return the recognised text. The text can then be output over the screen readers text-tospeech and provide the user with more information about the sign content. While the API has not been tested for this project, it is expected to deliver relatively good results, considering it was designed with the aim of recognising very small text such as addresses found on business cards. Eventually, it would be an appropriate next step to research the feasibility of utilising factor graph belief propagation for all stages of the recognition phase, i.e. for grouping pixels in the straight line extraction phase, detecting rectangular structures and analysing the icons and arrows on the sign. This methodology promises a very ecient implementation of detection procedures on computationally weak mobile platforms. The success of this method is almost exclusively based on the choice of suitable cues, which have to be carefully considered given the complex structure of the dierent icons found on a sign. With respect to future work, it would be particularly interesting to implement a general framework for factor graph belief propagation in Symbian C++ in order to provide a basis for further exploration of real-time image processing on this platform.
1
Optical Character Recognition
7.2. FUTURE WORK
70
By combining ecient object recognition through factor graph BP and OCR, it would be possible to develop the system even further to recognise various types of signs that combine icons and text using on-device processing. This is a highly interesting application of the methodologies described in this project that could even serve as a replacement for the currently server-client architectures currently in use to carry out computationally heavy image processing tasks. As the popularity of mobile platforms, and camera smartphones in particular, is expected to grow even further in the future, it is desirable to continue exploring their use not only for commercial software, but also for assisting people with physical disabilities.
Bibliography
[BB08] Erich Bruns and Oliver Bimber. Adaptive training of video sets for image recognition on mobile phones. Journal of Personal and Ubiquitous Computing, 13:165178, 2008. Dan Bloomberg. Leptonica. http://www.leptonica.com/ane.html. Accessed: 12/07/2009. BS 5499-4:2000. Safety signs, including re safety signs. The British Standards Institution, 2000. Tudor Dumitras, Matthew Lee, Pablo Quinones, Asim Smailagic, Dan Siewiorek, and Priya Narasimhan. Eye of the Beholder: Phone-Based Text-Recognition for the Visually-Impaired. In IEEE International Symposium on Wearable Computers, pages 145146, 2006. Forum Nokia. http://www.forum.nokia.com. Accessed: 10/04/2009. Yoav Freund and Robert E. Schapire. A Short Introduction to Boosting. Journal of Japanese Society for Articial Intelligence, 14:771780, 1999. Paul Fckler, Thomas Zeidler, Benjamin Brombach, Erich Bruns, o and Oliver Bimber. PhoneGuide: Museum Guidance Supported by On-Device Object Recognition on Mobile Phones. In International Conference on Mobile and Ubiquitous Computing, pages 310, 2005. Gartner Newsroom. http://www.gartner.com/it/page.jsp?id=910112, 2009. Accessed: 11/08/2009. 72
[Blo]
[BS500]
[DLQ+ 06]
[For]
[FS99]
[FZB+ 05]
[Gar09]
BIBLIOGRAPHY
73
[GdGH+ 06] N. J. C. Groeneweg, B. de Groot, A. H. R. Halma, B. R. Quiroga, M. Tromp, and F. C. A. Groen. A Fast Oine Building Recognition Application on a Mobile Telephone. In Advanced Concepts for Intelligent Vision Systems, volume 4179 of Lecture Notes in Computer Science, pages 11221132. Springer Berlin / Heidelberg, 2006. [Gel02] Andrew Gelman. Posterior Distribution. Encyclopedia of Environmetrics, 3:16271628, 2002. Rafael C. Gonzalez and Richard E. Woods. Digital Image Processing. Prentice Hall, 2nd edition, 2002. Volodymyr Ivanchenko, James Coughlan, and Huiying Shen. Detecting and locating crosswalks using a camera phone. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008. Intel. Probabilistic Network Library. http://sourceforge.net/projects/openpnl. Accessed: 05/06/2009. Frank Kschischang, Brendan J. Frey, and Hans-Andrea Loeliger. Factor Graphs and the Sum-Product Algorithm. IEEE Transactions on Information Theory, 47:498519, 2001. Nicole Kobie. Nokias Point & Find uses camera phone for search. http://www.itpro.co.uk/610402/nokias-point-nd-uses-cameraphone-for-search, 2009. Accessed: 03/04/2009. Kooaba. http://www.kooaba.com/mobile-marketing/cases. Accessed: 12/02/2009. Surendra M. Kumar and Timothy Jwoyen Tsai. CAT Camera Phone Color Appearance Tool. Stanford University, 2007. All About Mobile Life Blog. http://mobile.kaywa.com/qr-code-data-matrix. Accessed: 12/02/2009.
[GW02]
[ICS08]
[Int]
[KFL01]
[Kob09]
[Koo]
[KT07]
[Mob]
BIBLIOGRAPHY [Moo] Joris Mooij. libDAI A free/open source C++ library for Discrete Approximate Inference methods. http://www.kyb.mpg.de/bs/people/jorism/libDAI. Accessed: 05/06/2009. Nokia Mobile Codes. http://mobilecodes.nokia.com/scan.htm. Accessed: 12/02/2009.
74
[Nok]
[PTAE09]
Sobhan Naderi Parizi, Alireza Tavakoli Targhi, Omid Aghazadeh, and Jan-Olof Eklundh. Reading Street Signs Using a Generic Structured Object Detection and Signature Recognition Approach. In International Conference on Vision Application, 2009. RNIB. Statistics numbers of people with sight problems by age group in the UK. http://www.rnib.org.uk/xpedio/groups/public/documents/ PublicWebsite/public researchstats.hcsp. Accessed: 11/05/2009. Christof Roduner and Michael Rohs. Practical Issues in Physical Sign Recognition with Mobile Devices. ETH Zrich, 2006. u Huiying Shen and James Coughlan. Grouping Using Factor Graphs: An Approach for Finding Text with a Camera Phone. In Graph-Based Representations in Pattern Recognition, 2007. Symbian Developer Network. http://developer.symbian.com, 2009. Accessed: 10/04/2009. TAT The Astonishing Tribe. http://www.tat.se/site/showroom/latest design.html, 2009. Accessed: 11/08/2009. Jonathan S Yedidia, William T Freeman, and Yair Weiss. Understanding Belief Propagation and its Generalizations, 2003. Michael Juntao Yuan. What Is a Smartphone. http://www.oreillynet.com/pub/a/wireless/2005/08/23/ whatissmartphone.html, 2005. Accessed: 08/03/2009.
[RNI]
[RR06]
[SC07]
[Sym09]
[TAT09]
[YFW03]
[Yua05]
Appendix A Listings
Sobel Operator
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 T I n t CMSPImgProcessor :: DetectEdges () { T I n t w = iSize . iWidth ; TInt h = iSize . iHeight ; TInt imgSize = w * h ; [...] for ( i =0; i < imgSize ; i ++) { // if we re at the first column - first pixel of a row if ( i ==( y +1) * w ) { y ++; x =0; } else { x ++; } // initialise arrays with 0 grad [ i ] = 0; xGrad [ i ] = 0; yGrad [ i ] = 0; // if we re not in the first / last column or row ( image boundaries ) if (x >0 && x <w -1 && y >0 && y <h -1 ) { // apply Sobel filter xGrad [ i ] = gImg [ i + w +1] + gImg [i - w +1] + (2* gImg [ i +1]) - gImg [i -w -1] - gImg [ i +w -1] - ( gImg [i -1]*2) ; yGrad [ i ] = gImg [i -w -1] + gImg [i - w +1] + ( gImg [i - w ]*2) - gImg [ i +w -1] - gImg [ i + w +1] - (2* gImg [ i + w ]) ; grad [ i ] = Abs ( xGrad [ i ]) + Abs ( yGrad [ i ]) ; max = Max ( grad [ i ] , max ) ; } // end if } // end for
75
APPENDIX A. LISTINGS
26 27 28 29 30 31 32 33 34 35 36 37 38 39
76
// normalise values to range 0..255 // max is initialised with 1 to avoid division by zero TReal32 norm = 255.0/ max ; TReal32 g = 0.0; for ( i =0; i < imgSize ; i ++) { edges [ i ] = 0; // initialise with 0 and only change if necessary if ( grad [ i ] > 0) { g = static_cast < TReal32 >( grad [ i ] ) ; edges [ i ] = static_cast < TUint8 >( g * norm ) ; } } }
Listing A.1: Sobel operator and normalisation
Template Matching
1 TInt CMSPImgProcessor :: MatchTemplate ( CFbsBitmap * srcBitmap , TInt th ) 2 { 3 TInt direction = -1; 4 // load the bitmap from an . mbm file 5 _LIT ( KMBMFileName , " z :\\ resource \\ apps \\ Templates . mbm " ) ; 6 // create a new bitmap for the templates and push 7 // on the cleanup stack 8 CFbsBitmap * atemplate = new ( ELeave ) CFbsBitmap () ; 9 CleanupStack :: PushL ( atemplate ) ; 10 TInt imgSize = srcBitmap - > SizeInPixels () . iHeight 11 * srcBitmap - > SizeInPixels () . iWidth ; 12 // lock the global heap 13 srcBitmap - > LockHeap ( ETrue ) ; 14 TUint8 * src = ( TUint8 *) srcBitmap - > DataAddress () ; 15 for ( TInt i = 0; i <8; i ++) 16 { 17 TInt sum =0; 18 // load the template 19 User :: LeaveIfError ( atemplate - > Load ( KMBMFileName , i ) ) ; 20 TUint8 * temp = ( TUint8 *) atemplate - > DataAddress () ; 21 for ( TInt j = 0; j < imgSize ; j ++) 22 {
APPENDIX A. LISTINGS
23 24 25 26 27 28 29 30 31 32 33
77
// compute difference between source and template sum += Abs ( temp [ i ] - src [ i ]) ; } // if the difference for one template is less // than the threshold if ( sum < th ) { direction = i ; break ; } } srcBitmap - > UnlockHeap ( ETrue ) ; CleanupStack :: PopAndDestroy ( atemplate ) ; return direction ; }
Listing A.2: Template matching to determine the arrow direction

Image Processing On Mobile Platform

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Image Processing On Mobile Platform

Diunggah oleh

Hak Cipta:

Format Tersedia

IMAGE PROCESSING ON A MOBILE PLATFORM

By Samantha Patricia Bail School of Computer Science

3.4 3.5 3.6 3.7

1.3. MAIN OBJECTIVES

In the case of BS 5499-4 exit signs, the shade of green is Pantone3405CVC

1.5. DISSERTATION OVERVIEW

Figure 1.1: Two exit signs according to BS 5499-4

1.5. DISSERTATION OVERVIEW

Chapter 2 Project Background and Literature Review

2.2. MOBILE PLATFORMS

Operating Systems and Platforms

Figure 2.1: Worldwide smartphone sales to end users 2008

Symbian Software Limited was in fact acquired by Nokia in 2008.

2.3. MOBILE PHONES AS ASSISTIVE DEVICES

Mobile Phones as Assistive Devices

Image Processing and Object Detection

2.5. RELATED WORK

Related Work: Image Processing on Mobile Platforms

2.5. RELATED WORK

On-device Image Processing

2.5. RELATED WORK

2.5. RELATED WORK

2.6. ANALYSIS OF METHODS FOR OBJECT DETECTION

Analysis of Methods for Object Detection

2.7. FACTOR GRAPH BELIEF PROPAGATION

Factor Graph Belief Propagation

2.7. FACTOR GRAPH BELIEF PROPAGATION

2.7. FACTOR GRAPH BELIEF PROPAGATION

Pon (Cij ) P (Cij ) =1 of f

Pon (Cij ) P (Cij ) =1 of f

Pon (Cij ) Pof f (Cij )

2.7. FACTOR GRAPH BELIEF PROPAGATION

Belief Propagation on Factor Graphs

2.7. FACTOR GRAPH BELIEF PROPAGATION

mf x (x) max f (X) +

2.8. CHAPTER SUMMARY

Chapter 3 Application Design

3.2. REQUIREMENTS ANALYSIS

3.3. SOFTWARE ARCHITECTURE

3.3. SOFTWARE ARCHITECTURE

3.3. SOFTWARE ARCHITECTURE

3.4. IMAGE PROCESSING METHODS AND ALGORITHMS

Image Processing Methods and Algorithms

3.4. IMAGE PROCESSING METHODS AND ALGORITHMS

3.4. IMAGE PROCESSING METHODS AND ALGORITHMS

Extracting Straight Line Segments

Detecting Rectangular Shapes

3.4. IMAGE PROCESSING METHODS AND ALGORITHMS

3.4. IMAGE PROCESSING METHODS AND ALGORITHMS

3.4. IMAGE PROCESSING METHODS AND ALGORITHMS

3.4. IMAGE PROCESSING METHODS AND ALGORITHMS

Analysis of the Signs Content

3.5. TRAINING IMAGES

3.6. ISSUES AFFECTING THE SYSTEM PERFORMANCE

Issues Aecting the System Performance

3.7. CHAPTER SUMMARY

Chapter 4 System Implementation

4.3. IMAGE CAPTURING

4.4. PHASE ONE: FEATURE EXTRACTION

Phase One: Feature Extraction

4.4. PHASE ONE: FEATURE EXTRACTION

Straight Line Extraction