Anda di halaman 1dari 30

OPTIMIZED FLAME DETECTION USING IMAGE PROCESSING BASED TECHNIQUES

Abstract

Present work is an in depth study to detect flames in video by processing the data captured by an
ordinary camera. Previous vision based methods were based on color difference, motion
detection of flame pixel and flame edge detection. This paper focuses on optimizing the flame
detection by identifying gray cycle pixels nearby the flame, which is generated because of smoke
and of spreading of fire pixel and the area spread of flame. These techniques can be used to
reduce false alarms along with fire detection methods . The novel system simulate the existing
fire detection techniques with above given new techniques of fire detection and give optimized
way to detect the fire in terms of less false alarms by giving the accurate result of fire
occurrence. The strength of using video in fire detection is the ability to monitor large and open
spaces. The novel system also give the opportunity to adjust the system by applying different
combination of fire detecting techniques which will help in implementation of system according
to different sensitive area requirement.

1. Introduction
Fire detection system sensors are used to detect occurrence of fire and to make decision
based on it. However, most of the available sensors used such as smoke detector, flame
detector, heat detector etc., take time to response [1]. It has to be carefully placed in various
locations. Also, these sensors are not suitable for open spaces. Due to rapid developments in
digital camera technology and video processing techniques, conventional fire detection
methods are going to be replaced by computer vision based systems. Current vision based
techniques mainly follow the color clues, motion in fire pixels and edge detection of flame.
Fire detection scheme can be made more robust by identifying the gray cycle pixels nearby to
the flame and measuring flame area dispersion. For this project, we are using the spectral
emissions of a forest fire to detect them. In order to properly design a system to detect fires,
the process by which energy is emitted from objects must first be examined. The basis of
spectral emission is Wiens law. This law states that the peak wavelength of light that an
object radiates in meters is a function of the temperature in Kelvin of that object. This peak
frequency lets us select the band of the electromagnetic spectrum that we are interested in
examining. The spectrums of light that we are interested in for this project are the visible
range, the short wave infrared band (SWIR), the mid infrared band (MWIR) and the thermal
infrared band (LWIR). The values of the boundaries of these bands depend on the reference
material used. From Remote Sensing and Image Interpretation the visible range is 350 to
750nm, the SWIR band is, 1.3 to 3m, the MWIR band, from 3 to 5 m, and LWIR band
from 7.5 to 13 m. The reason there is a gap between the MWIR band and the LWIR band is
due to atmospheric absorption. From 5 to 7m, the water in the atmosphere absorbs the
majority of transmitted radiation. This is one of the largest absorption bands in the
atmosphere. There are other bands that are absorbed, such as the ozone absorption band
around 9 m, but the magnitude of the attenuation is small enough that those wavelengths are
still useable for remote sensing.
Another important consideration is the minimum detectable feature size. This is based on the
size of the pixel on the camera, the focal length of the lens, and the distance from the camera
to the target of interest. The ratio that defines the minimum feature size is. This formula
allows you to plug in the physical features of the system and determine the smallest object in
the area of the target that will show up as one pixel in the resulting image. This is important
to know, because knowing the size of the object in pixels, and the size of each pixel allows
you to calculate the physical size of the object. The ability to calculate the size of an object
allows you to set size detection thresholds for image processing.
Detector Types

The two types of detectors are used in this project charge coupled devices (CCD) and
microbolometers. These two camera constructions achieve similar effective results, but work
by different physical process. A CCD is composed of an array of capacitors, each
corresponding to a single pixel. As light strikes the array, the energy from the photons
separate electrons and holes. These electrons charge the capacitor. When the desired
exposure time has elapsed, the voltage charges on the capacitors are shifted along their rows,
read and digitized. The wavelengths that are absorbed by a particular CCD sensor depend on
the bandgap of the material that the substance is made of. CCDs constructed of silicon are
sensitive from approximately 300nm to 1m. Other materials have different bandgaps.
Mercury cadmium telluride (HgCdTe) is another example substrate. It is sensitive to the
thermal infrared band. Because CCDs made of silicon are sensitive to the visible range of
light, and silicon is already used for the majority of semiconductor devices, silicon CCDs are
used in all consumer digital cameras.

The other type of detector used is a microbolometer. A microbolometer is different from a


CCD in that it does not directly convert light energy to a voltage. This detector type is only
used for thermal imaging. Pixels on a microbolometer consist of an array of infrared
absorbing materials. When the thermal energy hits the pixel, it is absorbed by the absorption
material, and the pixel changes temperature. As the temperature of the absorber changes, the
resistance changes as well. The resistance of the pixel is measured, and that resistance value
is converted into an equivalent intensity. Because there are more steps required to translate
the energy intensity into a visible picture, microbolometers are less sensitive and less
accurate than CCD based systems.

Microbolometers are used because for the thermal band, a HgCdTe based CCD must be
cooled to cryogenic temperatures of approximately 77K in order to reduce the thermal noise
of the system enough to acquire useable data. The machinery required to achieve this is
bulky, expensive, and difficult to maintain. This results in CCD based infrared systems being
too costly to use for most applications. In a controlled laboratory setting, they are used due to
their higher accuracy. In industrial, commercial and military applications however,
microbolometers are used because they are less expensive and more robust. Mid wave
infrared detectors also must be actively cooled, resulting in many of the same problems.

System Specifications

There were many factors that had to be considered when creating specifications for a cost
effective fire detection system. One factor was detection range. For the system to be usable in
the field, each system must be able to effectively provide cover for a large enough area.
Another consideration is cost. The lower the cost of each individual unit, the more units can
be placed into operation, providing greater coverage to an area. The design specifications are
listed in Figure 2.

Initial Design Choices

In order to decide what camera to use, we had to examine the spectral emissions a fire emits.
By looking at the frequencies of maximum emission, the radiated energy is detectable at a
farther range than if we examined spectral bands with lower emitted energy. Ideally, we
would like to use the entire IR band from 1m to 15m. is not possible because of the water
absorption gap. This means that any detectors sensitive to these wavelengths would not
receive any light intensity on them in the 5-7m range, even if those frequencies are emitted
by the fire. The second reason is that there is no suitable detector that can receive a band of
frequencies that wide. The bandwidth of a detector is again related to the material the sensor
is made of. The first parameter we looked at is the spectral emission of a forest fire. The vast
majority of fire research in literature describes satellite based systems. According to (Sun, et
al., 2006) the optimal band for fire detection is 4.34-4.76 m. Unfortunately this is in the
MWIR band, where detectors are expensive and must be actively cooled. This makes them
unsuitable for an application such as cost effective fire detection, where the unit must be
placed out in the field with little maintenance or infrastructure for long periods of time.
Therefore, a combination of other bands must be used. The possible usable bands for fire
detection are ones for which uncooled detectors are available. This means that the visible
light spectrum, the SWIR and the LWIR bands are potential candidates. Knowing the
available camera types, the cameras to be used can be decided. Off the shelf consumer
cameras are inexpensive, but have a spectral response limited to the visible light spectrum.
Thermal cameras have sensitivity in the IR range of light, but are much more expensive. In
order to reliably detect a fire, we need to be able to see the fire in a wide variety of
conditions. The system has to be able to see through inclement weather effects, and work at
night, as well as day. The first camera system we decided to test was a standard off the shelf
consumer camera. The model that was chosen was the Canon Powershot A95. Figure 3
contains the specifications for this camera.
2. SYSTEM ARCHITECTURE

The proposed system is a new video based fire detection system that makes use of optical
flow features calculated from optical flow vectors created by different optical flow methods
for feature vector extraction and then make use of trained neural networks for feature vector
classification. The main highlight of this system is the optical flow vector creation that is
used to estimate the amount of motion undergone by an object while moving from one frame
to another. The main merit is that instead of making use of only flame based analysis for fire
detection, the system makes use of smoke based detection in addition, to find fire in
situations where the flame based system may fail. The overall system consists mainly two
halves, one for flame based detection, while the other one for smoke based detection.

The system consists of mainly two modules, one for identifying whether flame is present in
the frame and other module for finding whether smoke is present in the frame. The flame
based method makes use of OMT and NSD methods for optical flow vector creation. The
OMT method is successful for modeling fire with dynamic texture while, NSD method is
used for modeling saturated fire blobs. In smoke based module pyramidal Lucas-Kanade [2]
optical flow method is being used. Compared to Horn and Schunck method, Lucas-Kanade
method is very suitable for modeling smoke. The pyramidal Lucas-Kanade model is suitable
for modeling large motion objects very effectively rather than Lucas-Kanade without
pyramids.

3. MODULES
3.1 Flame Detection Module

The flame detection module will give as output whether flame is present in the frame. The
module works by considering two consecutive frames in the video and all processings are
done for each of the frame set. The processing start by converting the input RGB frames into
frames in HSV color space. Then a generalized mass transformation is applied for each of the
frame where it works on color basis, which is suitable for segmenting foreground from the
background. Next optical flow vectors are being calculated for the image produced after
generalized mass transformation. Then the less motion pixels are eliminated to avoid the
processing overhead by analyzing the magnitude of the flow vector. After that four features
are being calculated by analyzing the flow vectors. Finally a feed forward neural network is
used for feature classification. In the testing case the trained neural network will give the
probability of presence of flame in the frame.

3.1.1 Selection of Consecutive Frames

The processing starts by taking consecutive frames in the video. For a frame frame is
considered as the consecutive frame. Two frames are being selected since the optical flow
vector is being calculated for frame frame. The frames should be resized to 240 240
resolution.

3.1.2 RGB to HSV Transformation


The resized frames are converted into frames in HSV color space. For that built in function
rgb2hsv is being used. H, S and V are hue, saturation and value which represent the type,
purity and brightness of a color.

3.1.3 Color Based Transformation Generalized mass of a pixel is represented by the


similarity to the center fire color in the HSV color space. The center fire color is a
fully saturated and bright orange. Generalized mass is based on flame color, which is
suitable for segmenting foreground and background. Generalized mass image can be
computed as, [0,1]. In the new color transformed image that is formed high values
will be generated for those colors in the fire color range.

3.1.4 Optical Flow Vector Creation

Optical flow is a method used for estimating motion of objects across a series of frames. The
method is based on an assumption which states that points on the same object location
(therefore the corresponding pixel values) have constant brightness over time. Two methods
are being used for flow vector creation,

Optical Mass Transport (OMT) optical flow.

Non Smooth Data (NSD) optical flow.

Here for the flow vector creation first of all find the average intensity image as well as the
difference image. For that gaussian smoothened color transformed image is being used,
which is found by convoluting the image with a gaussian kernel of size 7. Then the central
sparse matrix derivative operators are found by convoluting the mean image with kernels of
size 7 which are specially designed for finding the derivative of image along directions
respectively. The solution for OMT method is,
is the average image formed by taking the mean of gaussian smoothened current image and
previous image. b = is the difference of current image and previous image, which are
gaussian smoothened.

Here is set as 0.4 and is found by convoluting the mean image with a laplacian of gaussian
kernel. are image derivatives. For fire motion the flow vectors created are non smooth, while
for rigid motion of object, smooth vectors are being created.

3.1.5 Rejection of Non Essential Pixels

To avoid unnecessary computation, non essential pixels have to be eliminated by analyzing


the magnitude of the flow vectors that are being created. For that, first find the norm of the
flow vector at each pixel position and then find the maximum value among them. Then find
twenty percentage of that maximum value. If the norm of the flow vector in each of the pixel
position is greater than the resultant value, that corresponding pixel is considered as essential
one. Non essential pixel elimination should be done for both OMT and NSD method.

3.1.6 Feature Extraction

Four features are being extracted by analyzing the magnitude and direction of the flow
vectors that are being created. In this stage consider only the essential pixels after the non
essential pixel elimination.

OMT Transport

Energy Fire and other fire colored objects in the fire colored spectrum will produce high
value for this feature. This feature measures the mean of the transport energy per pixel in the
subregion.
NSD Flow Magnitude

This value will be high for fire colored objects. NSD flow magnitude can be calculated by
taking the mean of half of the square of the norm of NSD flow vectors calculated at each
pixel position.

1.1 OVERVIEW
Documents have been the traditional medium for printed documents. However, with the
advancement of digital technology, it is seen that paper documents were gradually augmented by
electronic documents. Paper documents consist of printed information on paper media.
Electronic documents use predefined digital formats, where information regarding both textual
and graphical document elements, have been recorded along with layout and stylistic data. Both
paper and electronic documents confer to their own advantages and disadvantages to the user.
For example, information on paper is easy to access but tedious under modification and difficult
under storage of huge information. While electronic documents are best under storage of huge
data base but very difficult for modifications.
In order to gain the benefits of both media, the user needs to be able to port information freely
between the two formats. Due to this need, the development of computer systems capable of
accomplishing this interconversion is needed. Therefore, Automatic Document Conversion has
become increasingly important in many areas of academicia, business and industry. Automatic
Document Conversion, occurs in two directions: Document Formatting and Document Image
Analysis. The first automatically converts Electronic documents to paper documents, and the
second, converts paper documents to their electronic counterparts.
Document Image Analysis is concerned with the problem of transferring the document
images into electronic format. This would involve the automatic interpretation of text images in a
printed document, such as books, reference papers, newspapers etc. Document Image Analysis
can be defined as the process that performs the overall interpretation of document images. It is a
key area of research for various applications in machine vision and media processing, including
page readers, content-based document retrieval, digital libraries etc.
There is a considerable amount of text occurring in video that is a useful source of
information, which can be used to improve the indexing of video. The presence of text in a
scene, to some extent, naturally describes its content. If this text information can be harnessed, it
can be used along with the temporal segmentation methods to provide a much truer form of
content-based access to the video data.

Figure 1.1 Example of a documented video image clip


Text detection and recognition in videos can help a lot in video content analysis and
understanding, since text can provide concise and direct description of the stories presented in
the videos. In digital news videos, the superimposed captions usually present the involved
persons name and the summary of the news event. Hence, the recognized text can become a part
of index in a video retrieval system.
1.2 STATEMENT OF PROBLEM
Text in images and video sequences provide highly condensed information about the contents of
the images or video sequences and can be used for video browsing in a large video
database. Text superimposed on the video frames provides supplemental but important
information for video indexing and retrieval. Although text provides important information about
images or video sequences, it is not a easy problem to detect and segment them. The main
difficulties lie in the low resolution of the text, and the complexity of the background. Video
frames have very low resolution and suffer from blurring effects due to lossy compression.
Additionally the background of a video frame is more complex with many objects having text
like features. One more problem lies with the handling of large amount of text data in video clip
images.
1.3 OBJECTIVE OF THE STUDY
The main objective of this project is to develop an efficient text extraction system for the
localization of text data from the video image sequence. The project also aims in recognizing the
extracted text data and make it editable for further modifications. The project implemented,
performs transformation analysis on existing wavelet transforms for the suitability of wavelet
transformation for text isolation having multiple features. The project realizes morphological
operation on the wavelet coefficients and presents an efficient approach, to the recognition of
text characters from the isolated documented video image making it editable for further
modifications.
1.4 REVIEW OF LITERATURE
Many efforts have been made for text extraction and recognition in video image
sequence. Chung-Wei Liang and Po-Yueh Chen [1] in their paper DWT Based Text
Localization presents an efficient and simple method to extract text regions from static images or
video sequences. They implemented Haar Discrete Wavelet Transform (DWT) with
morphological operator to detect edges of candidate text regions for isolation of text data from
the documented video image.
A Video Text Detection And Recognition System presented by Jie Xi 1, Xian-Sheng Hua
, Xiang-Rong Chen , Liu Wenyin , Hong-Jiang Zhang [2] proposed a new system for text
information extraction from news videos. They developed a method for text detection and text
tracking to locate text areas in the key-frames. Xian-Sheng Hua, Pei Yin , Hong-Jiang Zhang in
their paper Efficient video text recognition using Multiple frame integration [3] presented
efficient scheme to deal with multiple frames that contain the same text to get clear word from
isolated frames.
Celine Thillou and Bernard Gosselin proposed a thresholding method for degraded
documents acquired from a low-resolution camera [4]. They use the technique based on wavelet
denoising and global thresholding for nonuniform illumination. In their paper Segmentation-
based binarization for color-degraded images [5] they described thestroke analysis and character
segmentation for text segmentation. They proposed the binarization method to improve character
segmentation and recognition.
S. Antani and D. Crandall in their paper Robust Extraction of Text in Video [7]describes
an update to the prototype system for detection, localization and extraction of text from
documented video images. Rainer Lienhart and Frank Stuber presented an algorithm for
automatic character segmentation for motion pictures in their paper Automatic text recognition
in digital videos [9], which extract automatically and reliably the text in pre-title sequences,
credit titles, and closing sequences with title and credits. The algorithm uses a typical
characteristic of text in videos in order to enhance segmentation and recognition.
Jovanka Malobabi, Noel O'Connor, Noel Murphy, Sean Marlow in there
paperAutomatic Detection and Extraction of Artificial Text in Video, [12] proposed an algorithm
for detection and localization of artificial text in video image using a horizontal difference
magnitude measure and morphological processing.
1.5 SCOPE OF STUDY
This project implements an efficient system for the extraction of text from a given away
documented video clips and recognizes the extracted text data for further applications. The
implemented project work finds efficient usage under video image processing for enhancement
and maintenance. The work can be efficiently used in the area of video image enhancement such
as cinematography and video presentation etc. The proposed work will be very useful under
digital library maintance of video database.
Following are the areas of application of text isolation and recognition in video images;
1. Digital library: For maintenance of documented video images in large database.
2. Data modification: Useful under modification of informations in video images.
3. Cinematographic applications: For enhancing the document information in movie video
clips.
4. Instant documentation of news and reports: For documentization of instant reports and
news matters in paper.
1.6 METHODOLGY
Many efforts have been made earlier to address the problems of text area detection, text
segmentation and text recognition. Current text detection approaches can be classified into three
categories:
The first category is connected component-based method, which can locate text quickly
but have difficulties when text is embedded in complex background or touches other graphical
objects.
The second category is texture-based, which is hard to find accurate boundaries of text
areas and usually yields many false alarms in text-like background texture areas.
The third category is edge-based method. Generally, analyzing the projection profiles of edge
intensity maps can decompose text regions and can efficiently predict the text data from a given
video image clip.
Text region usually have a special texture because they consist of identical character
components. These components contrast the background and have a periodic horizontal intensity
variation due to the horizontal alignment of many characters. As a result, text regions can be
segmented using texture feature.
1.6.1 DOCUMENT IMAGE SEGMENTATION
Document Image Segmentation is the act of partitioning a document image into separated
regions. These regions should ideally correspond to the image entities such as text blocks and
graphical images, which are present in the document image. These entities can then be identified
and processed as required by the subsequent steps of Automated Document Conversion.
Various methods are described for processing Document Image Segmentation. They
include: Layout Analysis, Geometric Structure Detection/Analysis, Document Analysis,
Document Page Decomposition, Layout Segmentation, etc. Texts in images and video sequences
provide highly condensed information about the contents of the images or video sequences and
can be used for video browsing/retrieval in a large image database. Although texts provide
important information about images or video sequences, it is not easy to detect and segment out
the text data from the documented image.
The difficulty in text extraction is due to the following reasons;
1. The text properties vary randomly with non-uniform distribution.
2. Texts present in an image or a video sequence may have different cluttered background.
Methods for text extraction can be done using component-based or texture-based. Using
component-based texts extraction methods, text regions are detected by analyzing the edge
component of the candidate regions or homogenous color/grayscale components that contain the
characters. Whereas texture based method uses the texture property such as curviness of the
character and image for text isolation. In texture based document image analysis an M-band
wavelet transformation is used which decomposes the image into MM band pass sub channels
so as to detect the text regions easily from the documented image. The intensity of the candidate
text edges are used to recognize the real text regions in an M-sub band image.
1.6.2 WAVELET TRANSFORMATION
Digital image is represented as a two-dimensional array of coefficients, each coefficient
representing the intensity level at that coordinate. Most natural images have smooth color
variations, with the fine details being represented as sharp edges in between the smooth
variations. Technically, the smooth variations in color can be termed as low frequency variations,
and the sharp variations as high frequency variations.
The low frequency components (smooth variations) constitute the base of an image, and
the high frequency components (the edges which give the details) add upon them to refine the
image, thereby giving a detailed image. Hence, the smooth variations are more important than
the details.
Separating the smooth variations and details of the image can be performed in many
ways. One way is the decomposition of the image using the discrete wavelet transform. Digital
image compression is based on the ideas of sub-band decomposition or discrete wavelet
transforms. Wavelets, which refer to a set of basis functions, are defined recursively from a set of
scaling coefficients and scaling functions. The DWT is defined using these scaling functions and
can be used to analyze digital images with superior performance than classical short-time
Fourier-based techniques, such as the DCT.
1.6.3 MORPHOLOGICAL OPERTION
Mathematical morphology as a tool for extracting image components that are useful in the
representation and descriptive of region shape, such as boundaries, skeletons and the convex
hull. It is defined two fundamental morphological operations, dilation and erosion, in terms of
the union or intersection of an image with a translated shape called as structuring element.
1.6.4 CHARACTER RECOGNITION
The essential problem of character recognition is to identify an object as belonging to a
particular group. Assuming that the objects associated with a particular group share common
attributes more than with objects in other groups, the problem of assigning an unlabeled object to
a group can be accomplished by determining the attributes of the object called as features. If
information about all possible objects and the groups to which they are assigned is known, then
the identification problem is straightforward, i.e., the attributes that is best discriminate among
groups and the mapping from attributes to groups can be determined with certainty.
Given the goal of classifying objects based on their attributes, the functionality of an automated
character recognition system can be divided into two basic tasks:
a) The description task generates attributes of an object using feature extraction techniques.
b) The classification task assigns a group label to the object based on those attributes with a
classifier.
The description and classification tasks work together to determine the most accurate label
for each unlabeled object analyzed by the character recognition system. This is accomplished
with a training phase that configures the algorithms used in both the description and
classification tasks based on a collection of objects whose labels are known--i.e., the training set.
During the training phase, a training set is analyzed to determine the attributes and mapping
which assigns labels to the objects in the training set with the fewest errors. Once trained, a
character recognition system assigns a classification to an unlabeled object by applying the
mapping to the attributes of that object. A measure of the efficacy of a trained character
recognition system can be computed by comparing the known labels with the labels assigned by
the classification task to the training set: as the agreement between known and assigned labels
increases, the accuracy of the character recognition system increases. Such a methodology for
configuring and evaluating the description and classification tasks of a character recognition
system is called supervised learning.
1.7 LIMITATION OF STUDY
This project work implements a text isolation and recognition system for the isolation of
text character from a given video sequence. The project work implemented has certain limitation
on the implementation. The implemented system gives less accuracy under high intensity
background of video image. The implementation also shows less accuracy to the extraction of
text and recognition under occultation. Under high variable components in video sequence the
system results in text isolation with noise.
Features Extracted From Gray Scale Images

A major challenge in gray scale image-based methods is to locate candidate character locations.
One can use a locally adaptive binarization method to obtain a good binary raster image, and use
connected components of the expected character size to locate the candidate characters.
However, a gray scale-based method is typically used when recognition based on the binary
raster representation fails, so the localization problem remains unsolved for dicult images. One
may have to resort to the brute force approach of trying all possible locations in the image.
However, one then has to assume a standard size for a character image, as the combination of all
character sizes and locations is computationally prohibitive. This approach can not be used if the
character size is expected to vary. The desired result of the localization or segmentation step is a
subimage containing one character, and, except for background pixels, no other objects.
However, when print objects appear very close to each other in the input image, this goal can not
always be achieved. Often, other characters or print objects may accidentally occur inside the
subimage (Fig. 3), possibly distorting the extracted features. This is one of the reasons why every
character recognition system has a reject option.

2.1 Template matching

We are not aware of OCR systems using template matching on gray scale character images.
However, since template matching is a fairly standard image processing technique [32, 33], we
have included this section for completeness. In template matching the feature extraction step is
left out altogether, and the character image itself is used as a \feature vector". In the recognition
stage, a similarity (or dissimilarity) measure between each template Tj and the character image Z
is computed. EZ and ETj are the total character image energy and the total template energy,
respectively. RZTj is the cross-correlation between the character and the template, and could
have been used as a similarity measure, but Pratt [33] points out that RZTj may detect a false
match if, say, Z contains mostly high values. In that case, EZ also has a high value, and it could
be used to normalize RZTj by the expression R~ ZTj = RZTj =EZ . However, in Pratt's
formulation of template matching, one wants to decide whether the template is present in the
image (and get the locations of each occurrence). Our problem is the opposite: nd the template
that matches the character image best. Therefore, it is more relevant to normalize the cross-
correlation by dividing it with the total template energy: Experiments are needed to decide
wether Dj or R^ ZTj should be used for OCR. Although simple, template matching su
ers from some obvious limitations. One template is only capable of recognizing characters of the
same size and rotation, is not illumination-invariant (invariant to contrast and to mean gray
level), and is very vulnerable to noise and small variations that occur among characters from the
same class. However, many templates may be used for each character class, but at the cost of
higher computational time since every input character has to be compared with every template.
The character candidates in the input image can be scaled to suit the template sizes, thus making
the recognizer scale independent.

2.2 Deformable Templates

Deformable templates have been used extensively in several object recognition applications [34,
35]. Recently, Del Bimbo et al. [36] proposed to use deformable templates for character
recognition in gray scale images of credit card slips with poor print quality. The templates used
were character skeletons. It is not clear how the initial positions of the templates were chosen. If
all possible positions in the image were to be tried, then the computational time would be
prohibitive.

2.3 Unitary Image Transforms

In template matching, all the pixels in the gray scale character image are used as features.
Andrews [37] applies a unitary transform to character images, obtaining a reduction in the
number of features while preserving most of the information about the character shape. In the
transformed space, the pixels are ordered by their variance, and the pixels with the highest
variance are used as features. The unitary transform has to be applied to a training set to obtain
estimates of the variances of the pixels in the transformed space. Andrews investigated the
Karhunen-Loeve (KL), Fourier, Hadamard (or Walsh), and Haar transforms in 1971 [37]. He
concluded that the KL transform was too computationally demanding, so he recommended to use
the Fourier or Hadamard transforms. However, the KL transform is the only (mean-squared
error) optimal unitary transform in terms of information compression [38]. When the KL
transform is used, the same amount of information about the input character image is contained
in fewer features compared to any other unitary transform. Other unitary transforms include the
Cosine, Sine, and Slant transforms [38]. It has been shown that the Cosine transform is better in
terms of information compression (e.g., see pp. 375{379 in [38]) than the other non-optimal
unitary transforms. Its computational cost is comparable to that of the fast Fourier transform, so
the Cosine transform has been coined \the method of choice for image data compression" [38].
The KL transform has been used for object recognition in several application domains, for
example face recognition [39]. It is also a realistic alternative for OCR on gray level images with
today's fast computers. The features extracted from unitary transforms are not rotation-invariant,
so the input character images have to be rotated to a standard orientation if rotated characters
may occur. Further, the input images have to be of exactly the same size, so a scaling or re-
sampling is necessary if the size can vary. The unitary transforms are not illumination invariant,
but for the Fourier transformed image the value at the origin is proportional to the average pixel
value of the input image, so this feature can be deleted to obtain brightness invariance. For all
unitary transforms, an inverse transform exists, so the original character image can be
reconstructed.

2.4 Zoning

The commercial OCR system by Calera described in Bokser [40] uses zoning on solid binary
characters. A straightforward generalization of this method to gray level character images is
given here. An nm grid is superimposed on the character image (Fig. 8(a)), and for each of the
nm zones, the average gray level is computed (Fig. 8(b)), giving a feature vector of length n m.
However, these features are not illumination invariant.

2.5 Geometric Moment Invariants

Hu [41] introduced the use of moment invariants as features for pattern recognition. Hu's
absolute orthogonal moment invariants (invariant to translation, scale and rotation) have been
extensively used .
Representation

In this chapter we discuss the representation of images, covering basic notation and information
about images together with a discussion of standard image types and image formats. We end
with a practical section, introducing Matlabs facilities for reading, writing, querying, converting
and displaying images of different image types and formats.

1.1 What is an image?

A digital image can be considered as a discrete representation of data possessing both spatial
(layout) and intensity (colour) information. As we shall see in Chapter 5, we can also consider
treating an image as a multidimensional signal.

1.1.1 Image layout

The two-dimensional (2-D) discrete, digital image Im; n represents the response of some
sensor (or simply a value of some interest) at a series of fixed positions (m 1; 2; ... ; M; n 1;
2; ... ; N) in 2-D Cartesian coordinates and is derived from the 2-D continuous spatial signal Ix;
y through a sampling process frequently referred to as discretization. Discretization occurs
naturally with certain types of imaging sensor (such as CCD cameras) and basically effects a
local averaging of the continuous signal over some small (typically square) region in the
receiving domain. The indices m and n respectively designate the rows and columns of the
image. The individual picture elements or pixels of the image are thus referred to by their 2-D
m; n index. Following the Matlab convention, Im; n denotes the response of the pixel
located at the mth row and nth column starting from a top-left image origin (see Figure 1.1). In
other imaging systems, a columnrow convention may be used and the image origin in use may
also vary. Although the images we consider in this book will be discrete, it is often theoretically
convenient to treat an image as a continuous spatial signal: Ix; y. In particular, this sometimes
allows us to make more natural use of the powerful techniques of integral and differential
calculus to understand properties of images and to effectively manipulate and
process them. Mathematical analysis of discrete images generally leads to a linear algebraic
formulation which is better in some instances. The individual pixel values in most images do
actually correspond to

some physical
response in real 2-D space (e.g. the optical intensity received at the image plane of a camera or
the ultrasound intensity at a transceiver). However, we are also free to consider images in

abstract spaces where the coordinates correspond to something other than physical space and we
may also extend the notion of an image to three or more dimensions. For example, medical
imaging applications sometimes consider full three-dimensional (3-D) reconstruction of internal
organs and a time sequence of such images (such as a beating heart) can be treated (if we wish)
as a single four-dimensional (4-D) image in which three coordinates are spatial and the other
corresponds to time. When we consider 3-D imaging we are often discussing spatial volumes
represented by the image. In this instance, such 3-D pixels are denoted as voxels (volumetric
pixels) representing the smallest spatial location in the 3-D volume as opposed to the
conventional 2-D image. Throughout this book we will usually consider 2-D digital images, but
much of our discussion will be relevant to images in higher dimensions.

1.1.2 Image colour


An image contains one or more colour channels that define the intensity or colour at a particular
pixel location Im; n. In the simplest case, each pixel location only contains a single numerical
value representing the signal level at that point in the image. The conversion from this set of
numbers to an actual (displayed) image is achieved through a colour map. A colour map assigns
a specific shade of colour to each numerical level in the image to give a visual representation of
the data. The most

common colour map is the greyscale, which assigns all shades of grey from black (zero) to white
(maximum) according to the signal level. The greyscale is particularly well suited to intensity
images, namely images which express only the intensity of the signal as a single value at each
point in the region. In certain instances, it can be better to display intensity images using a false-
colour map. One of the main motives behind the use of false-colour display rests on the fact that
the human visual system is only sensitive to approximately 40 shades of grey in the range from
black to white, whereas our sensitivity to colour is much finer. False colour can also serve to
accentuate or delineate certain features or structures, making them easier to identify for the
human observer. This approach is often taken in medical and astronomical images. Figure 1.2
shows an astronomical intensity image displayed using both greyscale and a particular false-
colour map. In this example the jet colour map (as defined in Matlab) has been used to highlight
the structure and finer detail of the image to the human viewer using a linear colour scale ranging
from dark blue (low intensity values) to dark red (high intensity values). The definition of colour
maps, i.e. assigning colours to numerical values, can be done in any way which the user finds
meaningful or useful. Although the mapping between the numerical intensity value and the
colour or greyscale shade is typically linear, there are situations in which a nonlinear mapping
between them is more appropriate. Such nonlinear mappings are discussed in Chapter 4. In
addition to greyscale images where we have a single numerical value at each pixel location, we
also have true colour images where the full spectrum of colours can be represented as a triplet
vector, typically the (R,G,B) components at each pixel location. Here, the colour is represented
as a linear combination of the basis colours or values and the image may be considered as
consisting of three 2-D planes. Other representations of colour are also possible and used quite
widely, such as the (H,S,V) (hue, saturation and value (or intensity)). In this representation, the
intensity V of the colour is decoupled from the chromatic information, which is contained within
the H and S components (see Section 1.4.2).

1.2 Resolution and quantization

The size of the 2-D pixel grid together with the data size stored for each individual image pixel
determines the spatial resolution and colour quantization of the image.

The representational power (or size) of an image is defined by its resolution. The resolution of an
image source (e.g. a camera) can be specified in terms of three quantities: . Spatial resolution
The column (C) by row (R) dimensions of the image define the number of pixels used to cover
the visual space captured by theimage. This relates to the sampling of the image signal and is
sometimes referred to as the pixel or digital resolution of the image. It is commonly quoted as C
R (e.g. 640 480, 800 600, 1024 768, etc.) . Temporal resolution For a continuous capture system
such as video, this is the number of images captured in a given time period. It is commonly
quoted in frames per second (fps), where each individual image is referred to as a video frame
(e.g. commonly broadcast TV operates at 25 fps; 2530 fps is suitable for most visual
surveillance; higher frame-rate cameras are available for specialist science/engineering capture).
. Bit resolution This defines the number of possible intensity/colour values that a pixel may have
and relates to the quantization of the image information. For instance a binary image has just two
colours (black or white), a grey-scale image commonly has 256 different grey levels ranging
from black to white whilst for a colour image it depends on the colour range in use. The bit
resolution is commonly quoted as the number of binary bits required for storage at a given
quantization level, e.g. binary is 2 bit, grey-scale is 8 bit and colour (most commonly) is 24 bit.
The range of values a pixel may take is often referred to as the dynamic range of an image. It

It is important to recognize that the bit resolution of an image does not necessarily correspond to
the resolution of the originating imaging system. A common feature of many cameras is
automatic gain, in which the minimum and maximum responses over the image field are sensed
and this range is automatically divided into a convenient number of bits (i.e. digitized into N
levels). In such a case, the bit resolution of the image is typically less than that which is, in
principle, achievable by the device. By contrast, the blind, unadjusted conversion of an analog
signal into a given number of bits, for instance 216 65 536 discrete levels, does not, of course,
imply that the true resolution of the imaging device as a whole is actually 16 bits. This is because
the overall level of noise (i.e. random fluctuation).

B - Optical Flow Method

Another possible way to detect moving objects is by investigating the optical flow which is an
approximation of two dimensional flow field from the image intensities, is computed by
extracting a dense velocity field from an image sequence. The optical flow field in the image is
calculated on basis of the two assumptions that the intensity of any object point is constant over
time and that nearby points in the image plane move in a similar way. [1] Additionally, the
easiest method of finding image displacements with optical flow is the feature-based optical flow
approach that finds features (for example, image edges, corners, and other structures well
localized in two dimensions) and tracks these as they move from frame to frame. Furthermore,
feature based optical flow method involves two stages. Firstly, the features are found in two or
more consecutive images. The act of feature extraction, if done well, will both reduce the amount
of information to be processed (and so reduce the workload), and also go some way towards
obtaining a higher level of understanding of the scene, by its very nature of eliminating the
unimportant parts. Secondly, these features are matched between the frames. In the simplest and
commonest case, two frames are used and two sets of features are matched to give a single set of
motion vectors.[5] Additionally, finding optic flow using edges has the advantage (over using
two dimensional features) that edge detection theory is well advanced. It has the advantage over
approaches which attempt to find flow everywhere in the image. The features are found
according to the below algorithm: Feature selection algorithm : 1. Compute the spatial gradient
matrix and its minimum eigenvalue at every pixel in the image I. 2. Call the maximum value of
eigen values over the whole image. 3. Retain the image pixels that have a eigen value larger than
a percentage of maximum eigen values. This percentage can be 10% or 5%. 4. From those pixels,
retain the local max. pixels (a pixel is kept if its eigen value is larger than that of any other pixel
in its 3 x3 neighborhood). 5. Keep the subset of those pixels so that the minimum distance
between any pair of pixels is larger than a given threshold distance (e.g. 10 or 5 pixels).[4]

1. Computation of Optical Flow :

The idea of optical flow is to calculate some function, velocity vector v = (u,v), for each pixel
in an image. The function v (u,v) describes how quickly each particular pixel is moving
across the image stream along with the direction in which the pixel is moving. Consider an
image stream described in terms of intensity as I(x,y,t). The intensitys position change over
time.

5. Fire Detection

Every year, thousands of people die in the home fires. There are a lot of reasons for these
fires like short circuits in electricity, children who play with match sticks, etc. Fire can easily
grow up in room conditions because there are a lot of flammable objects in homes like
carpets, curtains, wooden chairs, tables, etc. To reduce damage, we have to immediately try
to extinguish fire as soon as possible. In our project, to try to protect target person, we
developed fire detection system based on video processing. When fire is detected, alarm
sound begin to play with high volume. By this alarm sound, if there is a person in the next
rooms, he or she can protect target person.

We designed our fire detection system based on Flame Recognition in Video method [7]. In
this method, color and motion information are computed from video sequences to detect fire.
According to RGB color information of pixels, fire colored pixel are detected. Fire colored
pixels are possible fire pixels. To ensure about fire, temporal variations of fire colored pixels
are calculated. If temporal variation is above some level, fire is detected. Our fire detection
system contains three main parts: 1- Finding fire colored pixels (possible fire pixels)
2- Controlling temporal variations of fire colored pixels

3- According to temporal variations, detection of fire

A -Detection of Fire Colored Pixels:

To find possible fire pixels, firstly we find fire colored pixels according to RGB values of
video frames. We used following RGB values to detect fire [9]: R>220 G>200 125<B

R>220 125<G<B220 175<G<B

Most of the fire colored pixels is in these three ranges. So, if one pixels RGB values are in
these ranges, it is fire colored pixel. Nature of fire is translucent. So, this transparency makes
difficult fire to detect. Because of this reason, we average the fire color estimate over small
windows of time. Simply, we find fire colored pixels for each frames. We process last n
frames to decide real fire colored pixels. To calculate fire color probability (Colorprob), we
calculate average colorlookup value for last n frames. Colorlookup value is 1 if pixel values
are in fire ranges, zero otherwise. If colorprob is higher than some threshold value k1, this
pixel is real fire colored pixel. Following equations summarize these calculations:

In our project, we choose n is equal five, and k1 is equal to 0.2. This means that if one pixel
is minimum two times in fire color region for last five frames, this pixel is real fire colored
pixel. Following two examples are belonging to detection of fire colored pixels. First images
are real RGB images, and second images are output of detection of fire colored pixels. Blue
pixels are detected pixels:
B - Finding Temporal Variation:

Color is not always enough to detect fire correctly. Because in home environment, there can
be a lot of things, which have similar colors with fire kinds. To distinguish fire, we use
temporal variation of fire colored pixels. Video camera can takes 30 frames per second. It is
enough to observe characteristic motion of flames. We use temporal variation in conjunction
with fire color to detect fire pixels. Temporal variation for each pixel, denoted by Diffs, is
computed by finding the average of pixel-by pixel absolute intensity difference between
consecutive frames. However, this difference may be misleading, because the pixel intensity
may also vary due to global motion in addition to fire flicker. Therefore, we also compute the
pixel-by-pixel intensity difference for non-fire color pixels, denoted by nonfireDiffs, and
subtract that quantity from the Diffs to remove the effect of global motion. Temporal
variation for each pixel is calculated by following equation:

By this equation, we calculate total intensity change in last five frames. In this equation, n is
equal to 5. Our output is total intensity change results in matrix. Pixel by pixel, average total
intensity difference for non-fire color pixels is calculated by following equation:

C - Finding Fire Pixels:

Finally, we can detect fire. When color is fire color, and temporal variation is higher than
some k2 threshold value, this means that there is a fire. We choose k2 is equal to 10. Next
equation explains this detection. .
In the following two examples, you can see fire detection results. Blue areas have high
temporal intensity variation.

Anda mungkin juga menyukai