Anda di halaman 1dari 6

14 Character and Handwriting Recognition

In contrast to the eld of automatic speech recognition, where Markov model-based methods currently represent the state-of-the-art, HMMs and n-gram models are still a rather new approach for the recognition of machine-printed or handwritten texts. This might be due to the fact, that data obtained from writing or print can in general be segmented into manageable segments, as, e.g., words or characters, much more easily than spoken language. Therefore, especially in the eld of OCR, but also for the processing of forms or for the reading of address elds, a number of well established methods exists, that rely on the classic distinction between segmentation and classication. Segmentation-free methods on the basis of Markov models, however, are mainly applied by researchers, which previously gathered experiences with this technology in the eld of automatic speech recognition. In the following we will rst present the OCR system of BBN, which explicitly uses the BYBLOS system originally developed for the purpose of automatic speech recognition. It shows the principal approach for off-line processing, even though the recognition of machine-printed characters is obviously easier than the recognition of handwriting. Afterwards, we will present the online handwriting recognition system developed by Rigoll and colleagues at the University of Duisburg, Germany, which applies a number of well-know Markov model-based techniques. The chapter concludes with a presentation of our own system for ofine handwriting recognition, which just like the speech recognition systems presented in section 13.3 was developed using the ESMERALDA tool-kit.

14.1 OCR System by BBN


The OCR system developed at BBN is a Markov model-based recognition system for machine-printed text, which is capable of handling very large vocabularies and can also work with unlimited vocabulary. As already mentioned above it is based on the speech recognition system BYBLOS, which was presented in section 13.2, page 209. In order to be able to apply this system to OCR problems, some modications are

216

14 Character and Handwriting Recognition

required, which are documented in [13] and which will be presented briey in the following. Feature Extraction The basis of ofine character recognition is the optical capturing of a complete document page with a resolution of 300 to 600 dpi1 . In order to convert such a document image into a chronologically organized signal, rst the individual text lines are segmented. After a compensation of the rotation of the page image it is ensured, that lines are oriented horizontally. They then can be identied by simply searching the minima in a horizontal projection histogram of the gray level image. Then every text line is subdivided into a sequence of small vertical stripes, which overlap each other by two thirds. The width of the stripes is 1/15 of the line height, which was normalized in order to compensate for different font sizes. For feature extraction the individual text stripes are again subdivided into 20 overlapping cells, which correspond to small rectangular image regions. On the basis of these cells now feature vectors are generated. Per cell rst the average intensity and its horizontal and vertical derivative are computed. In a quadratic window of four cells furthermore the local derivative and the local correlation of the Gray values is determined. For every text line one thus obtains a sequence of 80-dimensional feature vectors, which are computed per stripe in the direction of the text. This can be viewed as sliding a small analysis window, which has the width of one text stripe, along the text line image a method, which is referred to as the sliding window approach. The technique was rst applied to the problem of ofine character recognition by researchers at BBN [214]. It is fundamental for transforming text image data into a chronological sequence of feature vectors, which can later be fed into an HMM-based analysis system. Today the sliding window approach can be found in the majority of HMM-based ofine recognizers. Modeling Machine Print The statistical modeling of machine-printed text is performed on the basis of context independent character models. Together with the models for punctuation symbols and white space, BBN uses 90 elementary models for English texts and 89 for Arabic. For each of these 14 states are dened, which are connected according to the Bakis topology. Therefore, both a linear passing through the state sequence and a skipping of individual states is possible (see section 8.1, page 127). The output probability densities of the particular model states are dened on the basis of a shared set of component densities similar to semi-continuous HMMs (see section 5.2, page 63). However, this is performed separately for subsets of 10 features from the total 80-dimensional feature vector. All partial output probability densities use an inventory of 64 Gaussians. The total density for a feature vector is obtained by multiplying all eight partial density values.
1

abbreviation for dots per inch

14.2 Duisburg Online Handwriting Recognition System

217

The training of the models is achieved by means of the Baum-Welch algorithm (see page 80), where in the same way as in automatic speech recognition only the orthographic transcription of every text line i.e. the actual character sequence is given. At the beginning of the training all HMM parameters are initialized uniformly. Which initial parameters are used for the eight mixture codebooks, unfortunately, is not documented in the literature. However, it may be assumed, that by means of a method for vector quantizer design (see section 4.3, page 50) initial codebooks are generated in an unsupervised manner. Language Modeling and Search Just in the same way as in automatic speech recognition for OCR a certain lexicon can be given. The restriction of potential word sequences can then be achieved in the usual way by a word-based n-gram model. BBN achieved with this method and a lexicon of 30 000 words an error rate on the level of characters of less then 1% for English texts and of a little more than 2% for Arabic documents. However, if no constraints are to be imposed on the expected words, the recognition needs to be performed on the basis of characters alone. Then it is said that an unlimited lexicon is used. By applying an n-gram model on the level of characters the missing restrictions of a xed recognition lexicon can be compensated to some extent. The error rates that can be achieved, however, increase by a factor of 2 to 3 if merely a character tri-gram is used. The combination of HMMs for individual characters and an n-gram model for the restriction of the search space is achieved by a multi-pass search strategy in the BYBLOS system (see page 210).

14.2 Online Handwriting Recognition System of the University of Duisburg


The online handwriting recognition system developed at the University of Duisburg, Germany, in the research group of Gerhard Rigoll2 is a writer dependent system for the recognition of isolated words. The writing style, however, is not restricted. As one of the very few systems in the eld of handwriting or optical character recognition it uses context dependent sub-word units, in order to achieve a modeling, which is as precise as possible even for large vocabularies. The systems details put together in the following are taken from [126, 198] and [124]. Feature Extraction In the Duisburg system the pen trajectories are captured by using a graphics tablet by WACOM. Such devices typically provide measurements with a sampling rate of
2

Meanwhile Gerhard Rigoll heads the Institute for Human-Machine Communication at the Technical University of Munich, Germany.

218

14 Character and Handwriting Recognition

approximately 200 Hz. Besides the pen position, which is determined with a precision of 2540 lpi3 and for an elevation of the pen of up to 5 mm over the tablet, the data also comprises the pen pressure in 256 discretized levels. In order to compensate for variations in writing speed, which are highly person specic, the raw data is rst re-sampled. In that process the new samples are usually placed equidistantly along the pen trajectory. However, in the Duisburg system depending on the local trajectory parameters the distance between the samples is optimized such, that even for fast changes in the writing direction a sufciently accurate resolution is ensured. On the preprocessed trajectory data then four types of features are calculated: The orientation of the vector connecting two successive pen positions represented as sin and cos , the difference of successive orientations sin and cos , the pen pressure, and a local representation of the region surrounding the current pen position in the form of a gray level image. For this so-called bitmap feature rst the pen trajectory is locally represented as a 30 30 binary image and then subsampled in a raster of 3 3 pixels. The nine gray values dened such are then used as additional features. Modeling Handwriting In the same way as in the system by BBN context independent models of characters form the basis of the statistical modeling of handwriting in the Duisburg system. For the German language 80 elementary models are used, which are dened as linear HMMs with 12 states for characters and 4 for punctuation symbols. In order to improve the precision of the representation, additionally so-called trigraph models are used, which are the character-based equivalent to triphones, i.e. character models in the context of the respective left and right neighboring symbols. As with a potential inventory of 803 trigraphs these models can not be trained robustly, on the one hand only those trigraphs are represented that sufciently often occur in the training data. On the other hand robust generalizations of parameters for unseen states are generated by automatically computing state clusters by means of decision trees (see section 9.2.2, page 157). For the modeling of emissions discrete and continuous HMMs as well as a hybrid approach, which incorporates neuronal networks, are investigated. The best results for a recognition lexicon of 200 000 words are achieved by the hybrid system, which combines discrete HMMs with a vector quantization stage on the basis of neuronal networks. There the parameters of both parts of the model can be optimized jointly by a method for discriminative training. Search As the Duisburg online recognition system is not used for the processing of word sequences, no n-gram language model is applied in the decoding stage. In order
3

abbreviation for lines per inch

14.3 ESMERALDA Ofine Recognition System

219

to make the search in the extremely large recognition lexica of up to 200 000 entries efcient, the necessary sub-word units are represented as a prex tree (see section 10.4.1, page 174).

14.3 ESMERALDA Ofine Recognition System


In contrast to most approaches for ofine handwriting recognition, which consider isolated phrases, e.g., in postal addresses, or only isolated words written in certain elds of a form, it is the goal of the system described in the following, to writer independently recognize complete handwritten texts. The principal approach is comparable with the OCR system by BBN. A detailed description of the methods used can be found in [246]. Preprocessing As opposed to machine-printed texts a suitable preprocessing of optically captured handwriting data is of fundamental importance. Similar to the system by BBN, after a position normalization of the document page rst a segmentation of the individual text lines is performed by evaluating the horizontal projection histogram. However, in handwritten texts the baseline of the writing in general is not strictly horizontal which is referred to as the skew of the line or the so-called baseline drift and individual characters or words are usually not written completely upright but with some varying inclination with respect to the vertical the so-called slant. Therefore, it is tried to compensate these variabilities before the actual feature extraction by normalization operations. After a global correction of the line orientation the skew is corrected locally together with the slant of the writing, so that also variations within a line can be captured approximately. As the ESMERALDA ofine recognizer is primarily intended for the recognition of texts in video data, then a local binarization of the text line image is performed. Thus it is ensured, that intensity variations of both writing and background do not adversely affect the subsequent feature extraction process4 . As a nal preprocessing step, the text line image is normalized in size. For this purpose rst local extrema of the contour of the writing are determined. Then the line image is re-scaled such, that the average distance between these matches a predened constant. Feature Extraction Just in the same way as in the OCR system by BBN the ESMERALDA ofine recognizer uses the sliding window technique to convert text line images into sequences of feature vectors. Pre-segmented and normalized text line images are subdivided into
4

Informal experiments showed, that the purely gray level based features of the BBN system could not be applied successfully to the processing of handwriting data. For the purpose of OCR, however, those features immediately achieved a convincing system performance.

220

14 Character and Handwriting Recognition

small stripes or analysis windows, which are four pixels wide and overlap each other by half. For each of these windows nine geometrical features are computed from the associated stripe of the binarized text image. The rst group of features describes the coarse shape of the writing within the local analysis window. The average distance of the lower baseline to both the upper and the lower contour of the writing is computed, and the distance of the center of gravity of the text pixels to the baseline. These features are then normalized by the core size, i.e. the distance between upper and lower baseline, in order to increase the robustness against variations in the size of the writing. Furthermore, three local directional features are calculated describing the orientation of the lower and upper contour as well as the gradient of the mean of the column-wise pixel distributions. Finally, the average number of black-to-white transitions per column, the average number of text pixels per column, and the average number of text pixels between upper and lower contour are calculated. In order to be able to consider a wider temporal context to some extent in the feature representation , the 9-dimensional baseline feature set is complemented by a discrete approximation of the temporal derivatives of the features, which is computed over a context of ve analysis windows by linear regression. Handwriting Model The statistical modeling of handwriting is performed on the basis of semi-continuous HMMs (see section 5.2, page 63) with a shared codebook of approximately 2, 000 Gaussians with diagonal covariances. A total of 75 context independent HMMs are created for modeling 52 letters, ten digits, twelve punctuation symbols, and white space. The number of model states is automatically determined depending on the length of the respective unit in the training material. All these models use the Bakis topology in order to be able to capture a wider variability in the length of the character patterns described. (see section 8.1, page 127). Language Modeling and Search In order to make handwriting recognition with unlimited lexicon possible only on the basis of character models, sequencing restrictions between HMMs for individual symbols are represented by means of a character n-gram models of increasing complexity5. For all models estimated the raw n-gram probabilities were smoothed by applying absolute discounting and backing off (see pages 101 and 106). The integrated decoding of HMMs for describing the handwriting and n-gram language models dening the sequencing restrictions is achieved in the ESMERALDA ofine recognizer by applying the time-synchronous search method described in detail in section 12.4 on page 198.

In [246] lexicon free experiments are reported for bi-gram up to 5-gram models.

Anda mungkin juga menyukai