1. Introduction
2.2. Preprocessing
(a) (b) ( c)
Preprocessing (pixel level processing) on the input file
Figure 3. Three Arabic words. (a) contains one makes it ready for further processes. This major phase
sub-word with 4 characters, (b) contains three includes the following steps:
sub-words with 1,1 and 2 characters
respectively, (c) contains two sub-words with 2 2.2.1. Global thresholding. A suitable threshold among
and 2 characters potential thresholds is selected as a global threshold by
employing Otsu’s method [15]. Pixels with value greater
Due to these major differences between Arabic/Persian than the global threshold are assumed as background and
and Latin or Chinese scripts, proposed methods for the others as foreground pixels.
latter are not suitable for the former. Single researches in
this subject did not start until the early 1980’s. IRAC, 2.2.2. Connected components recognition. Connected
which was suggested by Amin et al. [3] used a structural components (cc) are rectangular boxes bounding together
classification method. IRACII [4], was based on regions of connected foreground pixels. The objective of
segmentation technique. Badi and Shimura used the this step is to form these rectangles around distinct
concept of contour tracing and the identification of the components of the input file [16].
component cursive in their syntactic method [5]. In
another method [6] , sub-words are identified and 2.2.3. Grouping. The next step is the grouping of
separated in the text. Then a histogram is used to neighboring connected components of similar dimension.
segment each sub-word. Mahmoud [7] adopted a The algorithm takes one cc at a time and tries to merge
combination of Fourier descriptors and contour tracing it into a group from a set of existing groups. If it
for Arabic characters. Contour tracing also plays a very succeeds, the group’s dimensions are altered to cater for
crucial role in the system proposed by Allam [2]. During the new cc. If the cc can not be merged with any of the
last five years, researchers have suggested several other existing group then a new group is formed with its sole
methods and for a complete literature on the subject of member being the cc. Figure 4 shows an Arabic word, its
Arabic OCR, the reader is referred to [9]. It must be connected components and group.
mentioned that Sakhr Automatic reader no-3.01 and
Shonut’s Omnipage Pro Version 2.0 are two samples of
commercially available Arabic character recognition
systems [17].
In this paper, a Multi-font Arabic/Persian character
recognition system and its major phases have been
explained. Within the paper, “Arabic” refers to both
Arabic and Persian unless it is mentioned explicitly. (a) (b) (c)
Figure 4. (a) An Arabic word, (b) Its connected
2. The proposed recognition system components (sub-words), (c) Its group
The proposed character recognition system for Arabic 2.2.4. Skew detection and correction. The adopted
character set is composed of the following phases: skew detection algorithm [11], attempts to determine the
skew angle of the entire document by calculating a skew
2.1. Digitization angle for each group. Then, skew correction algorithm is
applied on the input file.
2.3. Feature extraction pixel of the contour, necessary information about each
pixel of the contour , contour’s length and class.
The proposed work has adopted a global approach for
character recognition and no character segmentation is 2.3.3. Sub-words detection. Another analysis on the set
required. of contours is begun to find a main body which
determines a sub-word. To find vertical boundaries for
2.3.1. Contour tracing. The basic step for determining this contour, two pixels of it with the largest and smallest
sub-words within each word is tracing the outer contours values of Y-coordinate must be identified. Tracing the
of all its elements. Within the boundaries of each group linked list of information about all pixels of the contour
(representing a word) a raster scan from top to bottom and comparing their Y-coordinates with the current
and left to right is started until the first foreground pixel amount of minimum and maximum does the task.
is reached. From this point, contour tracing is begun by
adopting Freeman chain code and the Left-Most-Looking 2.3.4. Detecting a Sub-word’s complementary
(LML) rule [7]. External and loop are two different types characters. All the contours of complementary
of contours which are produced in this step. An Arabic characters which belong to the detected sub-word should
word and its contours are shown in Figure 5. be found. Therefore, another search is taken place
through the array to find all the external contours which
2.3.2. Contour analysis. In this step, first all the word’s the Y coordinate of their starting points are fallen within
sub-words are determined and then each sub-word’s the two boundaries. As mentioned before, a dot can be in
contours are analyzed and classified. Each sub-word can a form of 1 dot, 2 or 3 dots. In some fonts, dots can be
have three types of external contours: main body, attached together and form a bigger island.
complementary character or noise. Big noises in size, For detecting non-attached dots, the distance between X
are another sort of contour that can be find in the image and Y coordinates of the starting point of their contour is
file. During the digitization phase, some spurious pixels checked. If these distances were less than two certain
may result in the image file . Fortunately, they are not thresholds, the contours belong to non-attached dots and
so big and would be recognized by their contour’s number of contours determines type of dots (2 or 3). To
length. detect type of attached dots (See Figure 6), three
different methods have been designed and tested on all
available Arabic fonts for Windows. Theses methods are
conducted based on comparison of length and area of the
attached dots’ contour with different defined thresholds
and also by comparing the chain-code sequence of the
contour with certain patterns. The achieved results have
shown the third method works best among the three. In
fact the third method enables the proposed system to
work as an Omni-font OCR for all the available fonts in
Arabic for Windows word processor.