In a country like India where different scripts are in use, automatic identification of handwritten script facilitates many important applications such as automatic transcription of multilingual documents and for the selection of script specific OCR in a multilingual environment. Existing script identification techniques depend on various features extracted from document images at character, word, text line or block level. All of these tasks fall under the general heading of Document analysis, which has been a fast growing area of research in recent years. We propose a novel method towards multi-script identification at block level. We describe a system that automatically identifies the script used in documents stored electronically in image form. The system can learn to distinguish any number of scripts. It develops a set of representative symbols (templates) for each script by clustering textual symbols from a set of training To identify a new document's script, the system compares a subset of symbols from the document to each script's templates, screening out rare or unreliable templates, and choosing the script whose templates provide the best match. The increase in usage of handheld devices which accept handwritten input has created a growing demand for algorithms that can efficiently analyze and retrieve handwritten data.
TABLE OF CONTENTS
CHAPTER NO.
TITLE
PAGE NO
i iv v vi
1.
INTRODUCTION
1.1 1.2
History of Script Identification Script Identification 1.2.1 1.2.2 1.2.3 Character, word or line analysis Text block analysis Hybrid analysis 4 3
1.3
7 7 9 10 10 10 10
1.4
System Requirements 1.4.1 1.4.2 1.4.3 Hardware Requirements Software Requirements Software Description
2.
LITERATURE REVIEW
12
3.
SYSTEM DESIGN 3.1 3.2 3.3 Pre-Processing Feature Extraction Neural Network
18 23 25 27
4.
IMPLEMENTATION
28
4.1
29 30 31 31 37
4.2 4.3
Feature Extraction Neural Network Training 4.3.1 Using the Neural Network Fitting Tool GUI
5.
SYSTEM TESTING
50
5.1 5.2
Testing the whole system Black box Testing and White box Testing
52 52
6.
55
6.1 6.2
56 56
APPENDIX
57
Snapshots
58
APPENDIX
64
Bibliography Reference
65 66
LIST OF FIGURES
FIGURE NO.
FIGURE NAME
PAGE NO.
3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 A.1 A.2 A.3 A.4 A.5
Overall Architecture of script Identification Classification Process of Identification Pre-Processing Image Retrieval Process Neural Network Training Performance Graph of Identification Regression Graph of Identification Neural Network Fitting Tool Starting Page Neural Network Fitting Tool Neural Network Fitting Data Set Chooser Validation and Test Data Network Size Assumption Train Network Neural Network Training Regression Graph Regression Graph (Plot Regression) Evaluate Network Save the Result identifying single Number Identifying Block of Numbers Identifying Block of Numbers Command Prompt for displaying Numbers Input for the Alphabets Combined with Number
18 19 21 23 24 30 32 33 36 37 38 39 40 41 42 43 44 45 46 58 59 60 71
72
A.6
LIST OF TABLES
Table No.
1. 2.
Table Name
Summarization of the Methods on Script Identification Difference between Black-box and white-box Testing
Page No.
5 50
LIST OF KEYWORDS
1. 2. 3. 4. 5. 6. 7. 8. OCR SOFM LVQ MATLAB ECM ANN FRS SRS Optical Character Recognition Self Organizing Feature Maps Learning Vector Quantization Matrix Laboratory Enterprise Content Management Artificial Neural Network Functional Requirement Specification System Requirement Specification