Anda di halaman 1dari 7

2011 3rd International Conference on Machine Learning and Computing (ICMLC 2011)

Recognition of Handwritten Malayalam Characters using Vertical & Horizontal


Line Positional Analyzer Algorithm
Abdul Rahiman M, M S Rajasree

Masha N, Rema M , Meenakshi R, Manoj Kumar G

Research Scholar, Karpagam University, Coimbatore &


Asst Professor, Computer Science, LBSITW Trivandrum.

Professor & Head, Dept of CSE, CET, Kerala.

Department of Computer Science & Engg


LBS Institute of Technology for Women,
Poojappura, Trivandrum, Kerala

Abstract This paper proposes an algorithm for the


recognition of handwritten characters in Malayalam, a South
Indian language. It introduces the salient features of
Malayalam script and lists the approaches used for character
recognition. Malayalam scripts are rich in patterns because of
their complex curved form, larger number of basic elements
and the presence of conjuncts. The combinations of such
patterns make the recognition of characters much complex and
these patterns should be exploited to arrive at the solution.
Here an image of handwritten Malayalam characters is given
as the input and an editable document of Malayalam
characters in a predefined format is produced as output. In
this paper, initially the overall structure of OCR system is
presented. Then, the OCR process is presented in three
modules: Pre-processing, Skeletonization and Recognition. In
Pre-processing, we scan the input image and separate each
character from it. In Skeletonization, we obtain one pixel thick
skeleton of the character. In Recognition, we classify the
characters based on their features. The features of the
characters are extracted based on the analysis of position and
count of the horizontal and vertical lines.
KeywordsMalayalam, Handwritten characters, Feature
Extraction, Optical Character Recognition.

I.

INTRODUCTION

Optical character recognition, usually abbreviated to


OCR, is the mechanical or electronic translation of scanned
images of handwritten, typewritten or printed text into
machine-encoded text. There are a number of reasons for
choosing OCR scanning over other methods of data entry.
Some of the more significant reasons include:

To reduce Data Entry Errors


To Consolidate Data Entry
To Handle Peak Loads
Human Readable
Can Be Used with Many Printing Techniques
Scanning Corrections.

II.

An automatic character recognition system is one of the


most fascinating and challenging areas of pattern recognition
with a wide range of practical applications like mail sorting,
forms processing, preserving historical documents in editable

978-1-4244-925 3-4 /11/$26.00 2011 IEEE

format, desktop publication, backup files of rare books,


reading aid for blind, and other applications involve
language processing, word indexing, library automation.
Reading handwritten texts is a very difficult task
considering the diversities that exist in ordinary penmanship.
Currently many OCR systems are available for handling
handwritten English documents using methods of defining
specifically-sized character boxes and read constrained
handwritten entries. Such systems are also available for
many European and Asian languages such as Japanese,
Chinese [1] [2] with reasonable level of accuracy. Though
many efforts at developing OCR systems for Indian
languages have been reported, an efficient system is yet to be
proposed.
Almost all Indian languages are two dimensional
compositions and this specialty is especially found in South
Indian languages such as Malayalam, Kannada, Tamil and
Telugu. They consist of core characters and modifiers. These
scripts are formed by curves, holes, loops unlike English
where the predominant feature [3] is strokes. Also the
concept of lowercase and uppercase character is not present.
There exist a lot of complex symbols than English as there is
no defined set of characters present in the case of South
Indian languages. They have more than 50 characters in their
character set including around 14 vowels and remaining
consonants. We can call this vowels and consonant as basic
characters. Different assumptions are there in the case of
number of characters present in each language. In addition
to this basic character set, there are combined characters
which are formed by the combination of more than one basic
character. The shape of compound character is very complex.
As a result of this combinational character the number of
symbols in the language increases and this leads to a
complex recognition process. This is one of the reasons for
the absence of efficient OCR system in these languages.

MALAYALAM CHARACTER SET

Malayalam is a South Indian language - which is the


principal language of the State of Kerala, spoken by about
36 million people in the world. The Malayalam script is
a Brahmic script used commonly to write the Malayalam
language. Like many other Indic scripts, Malayalam follows
a writing system that is partially alphabetic and partially
syllable-based. The Malayalam script uses both old and new
script for depicting characters. The old script concatenates

V4-404

2011 3rd International Conference on Machine Learning and Computing (ICMLC 2011)

various characters whereas new script separates the


characters with a special character. Malayalam alphabet
is unicase, i.e. it does not have a case distinction. It is
written from left to right, but certain vowel signs are
attached to the left (the opposite direction) of a consonant
letter that it logically follows.
The Malayalam language was first written in Vattezluttu,
an ancient script for Tamil. However, modern Malayalam
script evolved from Grantha, a script originally used to
write Sanskrit. Both Vatteluttu and Grantha evolved
from Brahmi, but independently. The different challenges
that exist in Malayalam script are its large character set of
roughly more than 900 characters, similarity of character
shapes, and complexity of character structure.As a
consequence of the disparity and irregularity in the
dimensions of characters, an algorithm which is totally

independent of the size yet concentrates on the characteristic


depiction is needed. The modern Malayalam script contains
13 vowel letters, 36 consonant letters, and a few other
symbols.
Vowels are known as Svaram or Svarakshrangal in
Malayalam. There are mainly 2 types of vowels:
Independent vowels and dependent vowels. An independent
vowel letter is used as the first letter of a word that begins
with a vowel. The vowel signs , i, are placed to the right
of a consonant letter to which it is attached. The vowel
signs e, , ai are placed to the left of a consonant letter. The
vowel signs o and consist of two parts: the first part goes
to the left of a consonant letter and the second part goes to
the right of it. The figure 1 represents the vowels used in
Modern Malayalam script:

Figure 1: Vowels used in Modern Malayalam Script.

Consonants in Malayalam are known as Vyanjanam or


Vyanjanaksharangal. A consonant letter, despite its name,
does not represent a pure consonant, but represents a
consonant + a short vowel /a/ by default. For example, is
the first consonant letter of the Malayalam alphabet, which
represents /ka/, not a simple /k/. The figure 2 represents the
consonants used in Modern Malayalam script.
A vowel sign is a diacritic attached to a consonant letter
to indicate that the consonant is followed by a vowel other

than /a/. If the following vowel is /a/, no vowel sign is


needed. The phoneme /a/ that follows a consonant by
default is called an inherent vowel. In Malayalam, its
phonetic value is unrounded. To denote a pure consonant
sound not followed by a vowel, a special diacritic virama is
used to cancel the inherent vowel. The figure 3 represents
the various vowel diacritics of one particular consonant /ka/:

Figure 2: Consonants in Malayalam Character Set.

V4-405

2011 3rd International Conference on Machine Learning and Computing (ICMLC 2011)

Figure 3: Various vowel diacritics of one particular Malayalam consonant /ka/.

Another type of characters exist in the Malayalam script


known as the conjunct consonants. These characters are
formed by the combination of more than one consonants
and were widely used in old script.Conjunct consonants are
important in Malayalam script as they help to convey more

meaning in the form of new characters.Though some of


these characters can be writtern separately in the new script
of writing, conjunct consonants cannot be fully avoided.
The figure 4 represents the some of the conjunct consonants.

Figure 4: Various conjunct consonants in Malayalam.

III.

EXISTING METHODS FOR MALAYALAM OCR

Due to the complexity of the Malayalam character set, an


efficient method for the recognition for handwritten
characters has not been proposed till now. Based on Ostus
algorithm for binarization an OCR system was devised by
Centre for Development of Advanced Computing [4]
(CDAC) Thiruvananthapuram, Kerala, a Government of
India Institution. In this system, projection profile method is
used for skew detection and correction of image; and in the
recognition phase linguistic rules are applied. An accuracy
of 97% was reported in this method. Using wavelet based
feature extraction and neural network based recognition, a
new work was reported by M Abdul Rahiman and Rajasree
[5]. Another work was reported by G Raju, [6]in which the
daubechie wavelets (db4) were used for recognition.
Another OCR system was proposed by Lajish V L, Suneesh
T K and Narayanan N K [7] [8] which was based on
statistical classification. Most recently, a method for the
recognition of Isolated Handwritten Malayalam Character
using HLH Intensity Patterns was devised by M Abdul
Rahiman, G Manoj Kumar and M S Rajasree [9].
IV.

OVERVIEW OF THE PROPOSED TECHNIQUE

The architecture of the method is depicted in figure 5.The


method focuses on the identification of
handwritten
Malayalam characters. The main advantage of this method

Figure 5: Architecture of the technique.

is that the identification of different coloured characters


written in same document is possible. Characters can also

V4-406

2011 3rd International Conference on Machine Learning and Computing (ICMLC 2011)

be identified even if they are written on a coloured


background. The method can identify background noises up
to a certain extent.
The character identification is done through 3 phases:

Pre-processing

Skeletonization

Recognition

Figure 6 shows the input image in different background


and gray scale image obtained after preprocessing.

A. Preprocessing Technigue
This technique involves the steps done before the actual
identification is done. In this phase, colour of the
background of the text and the written characters is checked.
If they are of different colours, methods are applied to
produce uniformly coloured characters. Background noises
up to certain intensities can be identified and removed. In
order to identify the characters, the characters are segmented
first to produce individual units of characters. The scanned
text is first subjected to line separation process where the
written document is separated into line of characters. After
line separation, each character in a line is subjected to the
character separation process. Here the characters in the line
are separated into individual units which simplify the
processes in the following phases.

Figure 6: Input Image and the Gray Scale of the Input Image.

Algorithm 1: Line Separation

Algorithm 2: Character Separation

Step 1: Initially, the width and height of the input image


are calculated and stored.

Step 1: The scanning starts at the top most pixel position of


the line of the document.

Step 2: The image is scanned from top left pixel till the
picture width by varying the x- coordinates.

Step 2: The pixels positions are scanned from top to


bottom till end of line by varying y coordinates.

Step 3: If any black pixel is found, the x and y positions


are stored. A line is drawn to indicate the start of a
line..
Step 4: The scanning is repeated by incrementing the y
value each time, to find the width of the line.

Step 3:

Step 5: The process is repeated till none of the pixels in the


line are black. A line is drawn to mark the end of
the line.
Step 6: The above three steps are repeated till the picture
height. The coordinates at the start and end of each
line is stored for character separation.

If any black pixel is found, the x and y positions


are stored.

Step 4: The scanning is repeated by incrementing the x


value each time to find the width of the character.
Step 5: The process is repeated till none of the pixels in the
line are black. A line is drawn to mark the
character.
Step 6: The above three steps are repeated for each line till
all the characters are separated and their positions
are stored.

V4-407

2011 3rd International Conference on Machine Learning and Computing (ICMLC 2011)

Figure 7: Input Image

B. Skeletonization
Skeletonization algorithms are the need to compute a
reduced amount of data or to simplify the shape of an object
in order to find features for recognition algorithms and
classifications. It is the transformation of a component of a
digital image into a subset of the original component. There
are different categories of skeletonization methods: one
category is based on distance transforms, and a specified
subset of the transformed image is a distance skeleton. The
original component can be reconstructed from the distance
skeleton.
Another category is defined by thinning approaches. The
result of skeletonization using thinning algorithms should be
a connected set of digital curves or arcs. The segmented
characters are subjected to the thinning algorithm.
Thinning is the process of peeling off a pattern as many
pixels as possible without affecting the general shape of the
pattern. The skeleton obtained must be as thin as possible,
connected and centered. Individual pixels are either
removed in a sequential order or in parallel. Normally, it is
implemented by an iterative process of transforming
specified contour points into background points.
Algorithm 3: Thinning
Step 1: Define 2 functions for a pixel P1, A(P1) and B(P1)
as follows:
i) A(P1) = number of 0, 1 patterns (transitions from 0 to 1)
in the ordered sequence of P2, P3, P4, P5, P6,
P7, P8, P9, P2.
ii) B(P1) = P2 + P3 + P4 + P5 + P6 + P7 + P8 + P9
(number of black or 1 pixel, neighbors of P1).

Figure 8: Line Separation

Step 2:

If it satisfies the following four conditions change


a pixel from black to white
i)
ii)
iii)
iv)

2 <= B(P1) <= 6,


A(P1) = 1,
P2 P4 P8 = 0 or A(P2) = 1,
P2 P4 P6 = 0 or A(P4) = 1

Step 3: Repeat Step 2 multiple times on the pattern


checking all the pixels each time, until when
no more black pixels can be removed.

Figure 9: Character Separation.

C. Recognition
The final phase of the identification of characters
involves a series of methods. The skeletonised and
segmented characters are made to undergo functions which
calculate the number of horizontal and vertical lines which
form the features of the characters. Using the count of
horizontal and vertical lines, the characters are classified
into different groups. For example consider the character
Ra. It has two vertical lines and a single horizontal line. So

V4-408

2011 3rd International Conference on Machine Learning and Computing (ICMLC 2011)

it can be classified into group of characters having similar


features. For recognition of certain characters, the count of
horizontal and vertical lines is enough. But for other
characters such as La, Va, Pa etc the position of these
lines are also important as these differentiate each other.
Hence the positions of these lines are also calculated i.e.
whether at the top, bottom, left or right. After calculating
the count and position of horizontal and vertical lines, the
characters are classified to form different groups.
Algorithm 4: To count the number of vertical lines in a
character

Step 7: If the left end point of the horizontal line lies in left
vertical segment and right end point in the right
vertical segment, position of the line is in the
middle.
Step 8: Else if the left end point of the horizontal line lies
In left vertical segment, position of the line is in the
left.
Step 9: Else if the right end point of the horizontal line lies
in right vertical segment, position of the line is in
the right.
Step 10: Else, position of the line is in the middle.
Step 11: Check whether the horizontal line is at the top,
bottom or centre of the character box.

Step 1: Initially, the height and width of the character is


calculated.
Step 2: The character box is scanned horizontally from the
top left position to find the left most black pixel.
Step 3: When a black pixel is found, count is incremented
and the pixels which are present at the top, bottom
and diagonal to it are also checked.
Step 4: If any of those pixels are black, the count is again
incremented. This is done to check for slanted or
curved vertical lines.
Step 5: Else if none of these pixels are black, check
whether the count has reached the character height.
If so a vertical line has been found and the vertical
line count is incremented.
Step 6: Repeat the above steps by incrementing the x
coordinate till the width of the character.

Figure 10: Character recognition process by calculating


number of horizontal and vertical lines

Step 7: The above process is repeated for each character in


the text.
Algorithm 5: To calculate the number and position of
horizontal lines in a character
Step 1: Initially, the height and width of the character is
calculated.
Step 2: The character box is scanned vertically from the
top left position to find the left most black pixel.
Step 3: When a black pixel is found, count is incremented
and the pixels which are present at the left, right
and diagonal to it are also checked.
Step 4: If any of those pixels are black, the count is again
incremented. This is done to check for slanted or
curved vertical lines.
Step 5: Else if none of these pixels are black, check
Whether the count has reached around the
character width. If so a horizontal line has been
found and the horizontal line count is incremented.
Step 6: Divide the character box into three vertical
segments in order to calculate the position of the
horizontal lines.

Figure 11: Recognised Characters

The figure 12 & figure 13 represents the decision trees


formulated based on the vertical and horizontal line position
analyzer algorithm. The following acronyms are used for the
decision trees.

V4-409

2011 3rd International Conference on Machine Learning and Computing (ICMLC 2011)

VL = Vertical Lines, HL = Horizontal Lines, M=Middle


Line, T=Top Line, B= Bottom Line.

V.

CONCLUSSION

In this paper, we have proposed an algorithm to recognize


the handwritten Malayalam characters on a scanned text to
produce an editable document. Till now we have tried to
accurately identify the basic simple handwritten characters
written in different colours and on different coloured
background. Pre-processing is done to improve the accuracy
of recognition of characters and to remove the background
noises. The major factor hindering the improvement of
accuracy is the similarity in character shapes and features of
certain Malayalam characters. Further refinement of the
system is possible by training the OCR Engine to handle
commonly encountered errors. Further investigation is
underway in order to recognize the old script of Malayalam.
The future scope of this paper lies in the recognition of
connected characters used in the script. An accuracy of 91%
is achieved in this work.
REFERENCES

Figure 12 : Decision tree based on classification of number of


vertical lines

[1]

[2]
[3]

[4]
[5]

[6]

[7]

[8]

[9]

Figure 13: Decision tree based on classification of characters by


horizontal lines where number of vertical lines equals 2.

V4-410

S N Srihari,X Yang and G R Ball, Offline Chinese Handwriting


Recognition: an assessment of current Technology, Front. Computer
Science, China, Vol. 1 (2), pp 137-155, 2007.
Mohamed Cheriet, Nawwaf Kharma, Cheng-Lin Liu and Ching Y
Suen, Character Recognition Systems 2007.
D. Trier, A K Jain and T Taxt, Feature Extraction methods for
Character Recognition A Survey, Pattern Recognition, Vol 29, pp
641-662,1996.
Journal of Language Technology, Viswabharat@tdil, July 2003.
M Abdul Rahiman and M S Rajasree, Printed Malayalam Character
Recognition Using Back propagation Neural Networks, Proc.of
IEEE International Advance Computing Conference (IACC 2009),
Patiala, March 2009.
G Raju Recognition of unconstrained handwritten Malayalam
characters using zero crossings of wavelet coefficients, Proc. of
International Conference on Advanced
Computing and
Communications, ADCOM, pp 217-221, Dec 2006.
Lajish V L,Suneesh T K K and Narayanan N K, Recognition of
Isolated handwritten images using Kolmogorov-Smirnov Statistical
classifier and K nearest neighbor classifier, Proc. Of International
Conference on Cognition and Recognition, Mandya, Karnataka,
December, 2005.
Lajish V L, Handwritten Character Recognition using perpetual
Fuzzy zoning and Class modular Neural Networks, Proc. of fourth
International Conf on Innovations in IT, 2007.
M Abdul Rahiman and M S Rajasree, Isolated Handwritten
Malayalam Character Recognition using HLH Intensity Patterns,
Second International Conference on Machine Learning and
Computing, Banglore, 2010.

Anda mungkin juga menyukai