ABSTRACT
The Digital Age has brought with it large-scale digitization of
historical records. The modern scholar of history or of other
disciplines is often faced today with hundreds of thousands of
readily-available and potentially-relevant full or fragmentary
documents, but without computer aids that would make it possible to find the sought-after needles in the proverbial haystack
of online images. The problems are even more acute when
documents are handwritten, since optical character recognition
does not provide quality results.
We consider two tools: (1) a handwriting matching tool
that is used to join together fragments of the same scribe, and
(2) a paleographic classification tool that matches a given document to a large set of paleographic samples. Both tools are
carefully designed not only to provide a high level of accuracy,
but also to provide a clean and concise justification of the
inferred results. This last requirement engenders challenges,
such as sparsity of the representation, for which existing solutions are inappropriate for document analysis.
1. INTRODUCTION
Paleography, the study of old handwriting, is the discipline
that engages in the decipherment of ancient texts. In addition,
it tries to ascertain when and where a given manuscript was
written, andif possibleby whom. Paleographers bring
many skills and tools to bear on these questions, in what is
often a complicated and laborious task, requiring reference to
paleographic, linguistic and archaeological data. Because it is
difficult to quantify the degree of certainty in the final readings
and assessments, experts have begun to develop computerbased methods for paleographic research. So far, such methods
have only been applied to small cases due to the high degree
of labor involved. Moreover, these efforts have focused almost
exclusively on scribal identity, and tend to use the computer as
a black box that receives images of manuscripts and replies
with a classification of the handwriting, which scholars may
be reluctant to accept.
In this work, we therefore study what we term computerized paleography, that is, digital tools that furnish the analysis
of a human paleographer with large-scale capabilities and assist with evidence-based inference. We explore two tasks:
handwriting matching and paleographic classification, focusing mainly on the Cairo Genizah. Due to the scattering of the
Genizah fragments in over 75 libraries, handwriting matching
is a fundamental task for Genizah scholars. These scholars
have expended a great deal of time and effort on manually
rejoining leaves of the same original book or pamphlet, and on
piecing together smaller fragments, often visiting numerous
libraries for this purpose.
Recently, a system was proposed [1] that automatically
identifies such potential joins, so that they may be verified by
human experts. While successful in finding new joins, that
system is a ranking algorithm that provides the expert with a
simple numeric matching score for every pair of documents.
The expert is left with the task of validating the join, without
being provided any insight as to the basis for the score. We
suggest that by using sparse representations and by avoiding
metric learning, the results we obtain are easily interpretable.
Sparse representations in the literature are shown to be ineffective for this specific task, and a new scheme is proposed.
The second task we consider is paleographic classification. We construct a paleographic tool that, given a fragment,
provides suitable candidates for matching writing styles and
dates. Such a tool can expedite the paleographic classification of fragments within the Genizah, and have long-reaching
implications beyond Hebrew texts.
Background. Recent approaches to writer identification,
such as letter- or grapheme-based methods, tend to employ
local features and employ textual feature matching [2, 3].
Early uses of image analysis and processing for paleographic
research are surveyed in [4]. Quantitative aspects can be measured by automated means and the results can be subjected to
computer analysis and to automated clustering techniques [5].
2. BASELINE IMAGE REPRESENTATION
The baseline method follows previous work [1, 6] and employs a general framework for image representation that has
been shown to excel in domains far removed from document
processing, based on a bag of visual keywords. The signature
of a leaf is based on descriptors collected from local patches
in its fragments, centered around key visual locations, called
keypoints. Such methods follow the following pipeline: first,
keypoints around the image are localized by examining the image locations that contain the most visual information. In our
case, the pixels of the letters themselves are good candidates
for keypoints, while the background pixels are less informative.
Next, the local appearance at each such location is encoded as
a vector. The entire image is represented by the obtained set of
vectors, which, in turn, is represented as a single vector. This
last encoding is based on obtaining a dictionary, containing
representative prototypes of visual keywords, and counting,
per image, the frequency of visual keywords that resemble
each prototype.
In [1], it was suggested to detect the image keypoints using
connected components, since, in Hebrew writing, letters are
usually separated. Each keypoint is described by the popular
SIFT descriptor [7]. To construct a dictionary, keypoints are
detected in a set of documents set aside, and a large collection
of 100,000 descriptors is subsampled. These are then clustered
by the k-means algorithm to obtain a dictionary of varying
sizes. Given a dictionary, a histogram-based method is employed to encode each manuscript leaf as a vector: for each
cluster-center in the dictionary, the number of leaf descriptors
(in the encoded image) closest to it are counted. The result is a
histogram of the descriptors in the encoded leaf with as many
bins as the size of the dictionary. Finally, the histogram vector
is normalized to sum to 1.
3. SPARSE CODING OF HANDWRITINGS
We wish to present evidence to the user regarding the similarity of two document in a concise and intuitive way. This
evidence should therefore be much more compact than the
size of the dictionary, which typically contains hundreds, if
not thousands, of prototypes. This suggests the use of a sparse
image representation.
Recently, several advancements have been obtained for the
bag-of-visual keywords approach. An efficient sparse coding
technique [8] is obtained by requiring that the solutions be
local, i.e. that coefficients are nearly 0 for all dictionary items
that are not in the vicinity of the descriptor. This is enforced,
for example, by adding weights in an L2-minimization setting.
While such sparse-coding methods outperform the standard bag-of-feature techniques for object recognition tasks,
they do not seem to perform well on Genizah fragments, neither for join finding nor for paleographic classification. We
hypothesize that the reason is that these methods were designed as hashing schemes, where recognition is obtained by
detecting the footprint of unique and distinctive descriptors
in the vector representing the image. In historical documents,
however, the distinguishing descriptors repeat multiple times
with small, yet significant, variations between documents.
Moreover, our interest in sparseness arises from the need to
have a succinct representation for the entire document (or pair
of documents). In both sparse and locality coding, individual
keypoints are represented by sparse vectors; however, the
(b)
(c)
Fig. 1. Example of charts produced for pairs of matching documents (a,b) and a non-matching false-positive pair (c). Every two
rows constitute a block that corresponds to one specific dictionary
prototype. Each block segment depicts the cluster center (top-left
corner), one row of keypoint examples from one document of the
pair, and one row of examples from the second document. Pairs of
matching keypoints (one per document) are placed one on top of the
other, with the best matching pair on the left. Shown are the top 7
contributing blocks for each document, sorted from top to bottom.
The total number of blocks is 25, 22, and 18 for the examples shown
in (a), (b), and (c) respectively.
Paleographic classification. For the paleographic classification task, we use sample documents and letters for each
of 365 script types. These samples are extracted from the
pages of the medieval Hebrew script specimen volumes, [10],
which contain many example manuscripts whose provenance
are known, and serve as an important tool in recent Hebrew
paleography.
The dictionaries for the paleographic classification tool are
constructed out of the extracted sample letters. As a result, the
prototypes are constructed from letters that are much cleaner
than the connected components that are extracted from the documents.The dictionary construction in this case follows two
steps. First, k-means is applied to all professionally-drawn
sample letters from the specimen volumes resulting in 600
clusters. The centers of each cluster are selected as prototypes. Second, the keypoints of each manuscript page from
the specimen volumes are localized by extracting connected
components of the binary images. The keypoints are then
assigned to the prototypes by means of descriptor similarity.
Lastly, prototypes for which the assigned keypoints (from the
manuscripts) greatly differ from the clustered samples of the
professionally-drawn letters are discarded.
The numerical criterion is as follows: first the center of all
assigned manuscript keypoints is computed by taking the mean
vector of all associated descriptors. Next, this new center is
compared to the original cluster center. Clusters for which the
two vectors differ considerably are discarded. Figure 2 shows
the initial dictionary obtained and the results of the filtering
step. As can be seen ambiguous clusters that do not correspond
to actual letters are discarded by this process.
Method
Baseline method w/
OSS learning [1]
LLC [8]
Sparse document
coding (ours)
Multiple + physical
+ subject [6]
Accuracy
tpr@
104 fpr
9.2
93.7
76.0
86.8
17.3
86.7
40.9
96.2
8.4
93.3
64.7
98.9
4.3
96.8
84.5
AUC
EER
95.6
5. EXPERIMENTS
To evaluate the quality of our join-finding efforts, we use the
comprehensive benchmark consisting of 31,315 leaves [1].
The benchmark comprises 10 equally sized sets, each containing 1000 positive pairs of images taken from the same
joins and 2000 negative (non-join) pairs. Care is taken so that
no known join appears in more than one set, and so that the
number of positive pairs taken from one join does not exceed
20. To report results, one repeats the classification process 10
times. In each iteration, 9 sets are taken as training, and the
results are evaluated on the 10th set. Results are reported by
constructing an ROC curve for all splits together by computing
statistics of the ROC curve (area under curve, equal error rate,
and true positive rate at a low false positive rate of 0.001) and
by recording average recognition rates for the 10 splits.
Table 1 summarizes the obtained benchmark results. As
can be seen our sparse document coding method that is aimed
at providing transparent (justifiable) scores performs similarly to the best baseline method, which employs the OSS
metric-learning technique. It also considerably outperform the
LLC [8] sparse-coding approach. Moreover, LLC vectors are
not sparse (see Section 3), and 60% of the coefficients in each
vector are non-zeros.
Our method, which is based solely on handwriting, is, as
can be expected, not as good as the state-of-the-art method,
which combines multiple dictionaries and physical measurements together with catalogic subject classification information. Note, however, that these auxiliary sources of information
could be combined with our method as well.
Paleographic classification. To evaluate the performance
of our paleographic classification tools, we collect a sample of
500 Genizah documents of varying script types and apply the
tool to them. It was shown in [6], that unsupervised clustering
is able to group such documents into (eighteen) clusters that
are relatively pure with regard to a coarse paleographic classification. Here, the focus is on the supervised task of matching
documents to an annotated gallery of examples.
For each Genizah document of the above-mentioned sample, we retrieve the most similar documents among those collected in [10]. We use both the baseline representation from [6]
Method/
ranking
1st match
2nd match
3rd match
other
Baseline
method
68%
27%
3%
2%
Sparse document
coding
79%
16%
3%
2%
Table 2. Results obtained for the baseline method and for the proposed sparse document coding method. The five highest ranking
results are obtained by each method, and the average percent of
correct matches per rank is recorded.