Anda di halaman 1dari 18

Zator Technical Bulletin

Number ^8

TH3 THEOHY OF DIGITAL HANDLING OF rTON-lTUMERICAL INFOEMATION


AND ITS IMPLICATIONS TO MACHINE ECONOMICS

"by

Calvin

N.

Mooers

"V*>

Copyright 1950 Sator Company


79 Milk Street* Boston 9* Mass,

Zator Technical Bulletin


Number 48

THE THEORY OF DIGITAL HAHDLIiJG OF ITON-NTJMERICA1 lOTQHMATIOH


A3D ITS IMPLICATIONS TO MACHINE ECONOMICS
Calvin N. Mooers
Zator Company2
The problem considered is the recall from storage of items
of non-numerical information. A n example is the library problem for
the selection of technical abstracts by subject specification from a
listing of such abstracts* There are now several digital machine meth
ods for dealing with this important problem, and the comparative
suc
cess and machine complexity of each is intimately connected with the
principle of digital coding employed. Each information item must be
characterized for selection by a set of descriptive terms or "descrip
tors" Independence of the descriptors in the selective process of an
item is most important. The different methods can be distinguished by
the manner in which they face or dodge the descriptor problem. The sys
tems now in use and which are considered are (1) alphabetical sorting!
(2) numerical code with sorting* (3) Dewey decimal coding, (4) method of
exclusive subfields* (5) unit card system, (6) Microfilm Rapid Selector
coding (Department of Agriculture system)* (7) Microfilm Rapid Selector
coding (AEC revision), and (8) superimposed random coding (Zatocoding).
The principles of each method are sketched and the implications of the
coding with respect to efficient searching and economical machine cost
are examined.
I
II
III
IV
V
VI
VII
VIII
IX
X
XI

Introduction
The Foundations
The Alphabetical Index
Numerical Code and Sorting
Dewey Decimal Classification
Method of Exclusive Subfields
Unit Card System
Microfilm Rapid Selector
Atomic Energy Commission Joint Project
Zatocoding
Epilog

I - INTRODUCTION
The problem under discussion here is machine searching and
retrieval of information from storage according to specification by
subject. A n example is the library problem of selection of technical
1
2

A paper presented before the Association for Computing Machinery at


their Rutgers University Conference on March 29 1950.
79 Milk Street, Boston 9 Massachusetts
- 2 -

abstracts from a listing of such abstracts. It should not be neces


sary to dwell upon the importance of information retrieval before a
scientific group such as this, for all of us have known frustration
from the operation of our libraries all libraries, without exception.
Unlike the problem of "random access" to mathematical tables
where the location of the desired tabular information is known in advance
once the independent variable is specified* in the library retrieval
situation the position and the very existence of the desired information
must be discovered.
The retrieval problem is a digital problem, for in fact all
human communication is digital. Information retrieval is a non-numerical
problem in part because the most of human communication is verbal, but
more important because most ideas or concepts cannot be mapped into
a Euclidean 3~space, or higher space,
vhile there are scale readings
for the representation of some information, these are relatively few and
unimportant. Spacial and metrical concepts do not apnly to most infor
mation, at least not at the simpler levels. Yet, though the information
retrieval problem is non-numerical, there does not; seem to be any alter
native to the use of digital techniques for its solution. Digital
information retrieval systems employing machines are already operating,
and their degree of success seems to indicate that this is the direction
of progress.
The intent of this paper is to bring to the attention of
members of the Association (1) that objective and scientific analysis
and design can be applied to information retrieval problems, (2) that
some very worth-while answers are already available, and (3) that the
responsibility for the future development of this important field will
fall upon the members of this Association.
II - THE FOUNDATIONS
In spite of the poor record to date, i::i ormation retrieval
can be treated scientifically when the problem is accurately stated and
the applicable parameters are defined and used. Unfortunately, most of
the reasoning applying to the mathematical operation of computing mach
ines does not apply.
The closest analogy to our problem is that of looking up
numerical values in a table. However the retrieval requirements trans
form this familiar operation into a distorted situation where one might
think of a table in vhich some values are repeated, some are left out,
and many inconsistent values are given for the same argument. There is
no indication where any of the values are located, but there is the requir
ement that every value must be found.
-3-

An information retrieval system must "be evaluated by at least


two criteria:
(1) can the system do the job* and (2) how expensive is
the solution in terms of the machine* storage capacity or time? It will
be seen that most systems are actually incapable of information retrieval
a conclusion that has been reached empirically by many scientists*
but which does not seem to have been seriously realized by those charged
with organizing information* who currently are the librarians. When the
system can do the job* the next question concerns its economics.
It is necessary at this point to make a request of my audience:
In order to approach this slippery problem with any hope of success or
efficiency of thought* it will be necessary for us to put aside almost
all the ideas* doctrines* and symbolic or metaphysical superstructure
about libraries and library methods that we hive learned or otherwise
picked up in the past. It can be said and demr .strated that almost
everything that the- librarians hold dear in classification is absolutely
wrong for information retrieval. It is my hope to develop the details
of this assertion in subsequent papers. However, for the moment* let
us put aside all preconceptions and examine the consequences,
V.rhat sort of a precise foundation can we put under the study
of information retrieval? To this question* a useful sequence of thought
and argument might be sketched in the following fashion: A n item of
information will be considered to be a single report* scientific paper*
or unitary piece of data. In a moment we will see that it is impossible
to achieve retrieval by any system of ordering of the physical items
themselves; and the logicr.l obverse of this fact is that retrieval re
quires either some form of scanning in entirety of all information items
in the collection* or of some manner of dealing with symbolic abbre
viations of the content of the items. Scanning in entirety is humanly
impracticable and technically undesirable. Therefore* attention must
be directed to methods of symbolic description.
Let there be an artificial language without synonyms whose
vocabulary is
, fixed list of statements or ideas* which ve shall call
alternatively "attributes" or "descriptors". In the simplest case, this
language is given no further algebra (or internal topology or grammar
if you will) than the logical product3 of the descriptors. More complcated algebras have been used quite successfully in chemistry, but it
is possible to go very far with this simple algebra. Upon looking at
an apple, one could apply the descriptors "fruit" and "red". Each is
applied separately, and the apple, being a red fruit, is characterized
by this logical product which describes those thir ;s that are both
red and a fruit.
3

George Boole.

The mathematical analysis of logic.

Cambridge, 18^7,

A n y information item can have a set of assertions made about


it in terms of descriptors chosen from the vocabulary. Thus* the item
can be represented by a complex of these descriptors. There can be only
a finite number of descriptors in this complex. V/here the set ((a^))
is the vocabulary, then the j
information item can be symbolically
represented by Cj(ai*a 2 a^....a^).
The requirements of information retrieval, of finding infor
mation v/hose location or very existence is a-priori unknovm* now requires
that it be possible by some efficient technique to specify a selection
of complexes
by means of any set or combination of descriptors chosen
in any way from the vocabulary ((a^)). There must be complete indepen
dence in the choice and use of descriptors.
This is the information retrieval problem as the user of infor
mation sees it. This is how the user insists upon specifying his infor
mation. Unfortunately, the basic organization and
/orking philosophy
of our libraries is concerned with putting avay ir_:'ormation its
listing, shelving and storage with very little t.'.oujht to use. Such a
philosophy is incompatible with the requirements of information retrieval
as I have stated them here, and we shall see why this is so in the study
of library systems which follows.
The different methods proposed and in use for information
retrieval can be searchingly criticised in accord with the manner in
which they meet or dodge the paramount necessity of complete freedom and
independence in the use and choice of descriptors*
T.;
e shall now restrict our field by requiring (1) that a system
be capable of dealing with collections of information that may become
very large (certainly larger than 10 items), (2) that, in general,
at least five attributes or descriptors are necessary for adequate des
cription of the item, (3) that a request for retrieval will take no less
than three descriptors operating in conjunction, and (4) that the system
is in some way dependent upon a machine in its operation. For the sa.ke
of standard nomenclature, we will say that a card (or cards) is assigned
to each item of information; that the card bears the citation, "address",
or location of the actual document; and that the card also carries in
some manner a symbolic designation of the descriptors applicable to the
item. In our study, with emphasis upon computing machines and allied
techniques, these symbolic designations for selection will be digital,
thus allowing machine selection. Techniques which can be carried out
on cards can certainly be extended to photographic film strips, magnetic
tapes, or other memory or storage devices. In general I will not discuss
such extensions. With these ground rules, we shall new look over some
current methods and proposals for information retrieval.
-5-

Ill - THE ALPHABETICAL INDEX


The standard alphabetical index is one of the simplest methods
for information retrieval. A card bears in ordinary language the digital
verbal statement of the applicable descriptors* i.e. a list of written
v/ords. As has been formulated here, the usual problem of synonyms has
been eliminated by the use of the standard vocabulary of attributes.
Tabulating machines can be used to sort the cards into a unique alphab
etical order providing the cards are punched corresponding to the des
criptions. Retrieval presumably consists in going to the one place in
the linearly ordered alphabetical file (according to the descriptors
specifying the selection) and there finding directly the cards bearing
these descriptors.
Such ease of finding is virtually never the case. It would
require, for an item described by five descriptors, that there would be
factorial five (120) cards in the file, each c.rd with the descriptors
listed in a different order. Such ridiculous multiplicity of cards would
be intolerable, as any library user could testify. Consequently all
these combinations are never formed. Familiar cross-references are nec
essary, and there is a serious loss of utility in retrieval. Therefore,
I preclude any system having a maze of cross-references as being incap
able of handling the multiple descriptor situation.
It can be definitely said that the alphabetical index (either
on cards or listed in a book) does not meet the fundamental requirement
of a complete independence in the use of descriptors in retrieval, and
it can never meet this requirement

ithout inordinate multiplication of


the storage requirements, or by the use of an unacceptable system of
cross-referencing. For this reason the alphebetical index is incapable
of information retrieval in the sense here under discussion.
IV - IIUMSRICAL CODE Ai-TD SORTING

The patent office classes and subclasses seem to be one of the


best examples of this technique, though it is not fully expanded so as
to differentiate down to a single patent. The Dysonian system^ of ciph
ering organic chemical compounds also seems to fall in this category. For
the purpose of the study here, this method can be thought of as the al
phabetical technioue described above, with a translation of letters into
numbers. I know of no system in this category which actually displays
an independent use of descriptors.
4
5

Manual of Classification of Patents, 19^7* U.S. Department of Commerce.


G. Malcolm Dyson. A new notation and enumeration system for organic
compounds. (2nd ed.) London, Longmans Green. 1949*
-6-

V - DEWEY DECIMAL CLASSIFICATION


This sytem of numorical classification was devised by Melvil
Dewey in 1873, and now is in widespread use for classification of lib
rary books in the United States. On the Continent an expanded version
of the Dewey system (more decimal places) is known as the Universal
Decimal Classification. The first empirical comment that can be made
of the Dewey system is that the librarians who use it for putting away
books on the library shelves never themselves use the Dewey decimals
directly for information retrieval. If this state'rjent provokes any
disagreement, I suggest that you try asking a librarian to bring out
to you all books in class 512.2 or any other class, or that you ac
tually try to use the classification schedule and nothing more--to
find your information. In fact, the librarians themselves have indexed
the Dewey schedule so they can find the subjects listed in it!
Far more pertinent to our study here is the theoretical basis
of this decimal classification, './hile there is an elaborate dogma of
postulates, I believe we can quickly cut through to the core of the
matter by the following reasoning. A basic assumption of the system
is that each information item in the universe can be mapped into a
single (and it is believed unique) point on the real line interval from
0 to 1. This is the librarian*s ideal of "pin-pointing" the information.
It is further believed, by proper attention to the construction of the
classification schedule, that this mapping can be made topologically
continuous. What this means is that about any point on the real line
interval there is a neighborhood, or small segment of the line, in
which all points are associated with the information items having a close
conceptual similarity to each other. This mapping of ideas onto the
line "groups the ideas--or classifies them according to the beliefs
of the proponents of the Dewey decimal system. Moreover, they believe
that the mapping is such that the neighborhood about a given point
(with the neighborhood assumed to be connected, and not broken into
segments) must contain a l 1 the coneptually similar points, with em
phasis on the "all".
From ordinary experience with libraries organized by the
Dewey system, we know that these beliefs do not correspond to fact.
The mapping of idea complexes onto the real line interval is not
topologically continuous in a library. Books written in German, for
instance, are scattered throughout the shelves.
I is inability to
set up such a mapping is not due to any lack of skill or patience or
lack of revision of the classification schedules. It is due to a
fundamental property of information itself as compared to the decimal
technique.
The same difficulty that precludes the mathematical definition
of a continuous transformation from a space of two or more dimensions

- 7 -

6nto a single-dimensional line element precludes the attainment of


the Dewey decimal idea. Idea complexes# in so far as they contain more
than one independent descriptor, are multi-dimensional. The Dev/ey trans
formation from idea space onto the one-dimensional 0-1 interval is
logically impossible.
The postulates of the Dev/ey system are incompatible among
themselves, and the system can never be readjusted so as to perform the
task set for it. Practically, and from the retrieval standpoint* the
Dev/ey classification does and must scatter widely the information bear
ing on an arbitrary idea complex. Therefore, since it misses its only
goal, it is really incapable of information retrieval as we have formu
lated it.
The amazing thing about the continued usage and growth of
the Dewey decimal system, and the U.D.C.* is the tyranny that it has
exerted over scientists i/ho really should have known better. For at
least thirty or forty years mathematicians have had the critical tools
available that would have demolished the Dewey postulates* and we can
only v/onder why it has not been done before this time*
VI - METHOD OF EXCLUSIVE SUBFIELDS
This is the descriptive name of he most prevalent method of
information retrieval using punched cards.
It is a tantalizing method
in that it almost makes the grade to give a very efficient solution to
the problem. ITear-success of this method has often been mistaken for
real success, and therefore this method has attracted a great deal of
attention as a competent punched card method for solution of the retrieval
problem.
B y the method of exclusive subfields* each information item
or report is given one punched card. In selection the whole collection
is scanned by some mechanical device. The digital coding area of
each card is partitioned into a standard set of subfields* each of
which can contain the digital representation of a single descriptor.
The fundamental
determinacy in allocation
ous exclusive subfields.
for the different classes
device must be capable of
6
7

difficulty of this system is due to the in


or placement of descriptors among the vari
Either there must be a standard placement
of descriptors* or a machine selection
trying all possible subfield locations.

L. E. J. Brouver. Bev/ies der Invariance der Dimensionenzahl.


Math. Ann. vol. 70 (1911) PP 161-165*
For citations refer to: Lorna Ferris. Kanardy Taylor, and J. W. Perry,
Bibliography on the uses of punched cards* procurable from the
American Chemical Society.
-8-

Such a machine is necessarily complicated and wasteful. Alternatively,


if the descriptors are placed in a standard location, there arises a
subtle rigidity that has in general been little understood, though the
effects have been deplored. These effects have led to the general
opinion that any punch card cannot carry enough punches, because it
has always been impossible to organize a system using punched cards
which could actually handle the full range of information of a large
file.
The effect is real, and it is serious. For a complete discus
sion of this indeterminacy of subfields, reference is made to the
author1s earlier p a p e r , ^
Approximate solutions to this problem, when attempted, invari
ably lead to restrictions upon the use of descriptors; they can no longer
be used in an independent fashion.
Therefore* the punched card system with mutually exclusive
subfields is not found to meet the requirements of information retrieval
as we have formulated them,
VII - UNIT CARD SYSTEM
This ingenious technioue has a solution to the problem of
indeterminacy of subfields, but it achieves this goal at the cost of
a separate card for every descriptor of. every item in the collection.
The unit card system can first be criticised for the excessive demands
it makes on the storage system. Experiments with this method, using
office tabulating machinery, have been conducted at the U. S. Patent
Office, the Chemical-Biological Coordination Center, and at other
places.
Each card has two symbols: a document number punched at one
end, and a single descriptor punched at the other end. There are as
many cards for each item as there are applicable descriptors. A
patent specification might take as many as 20 to 50 cards for as many
descriptors. The cards are alphabetized into serial order according
to the descriptors and the document number* and are then stored in thiB
order.
Selection is made upon a concurrence of descriptors such as
a, b, and c". To do so, one goes to the file under descriptor "an#
takes out the cards, doing in turn the same for Mb !l and "cn , There
will be many cards in each collection. The cards for descriptors Ma n
and 11b" are then placed in the two feed positions of a collating m a c h -
ine, and coincidences between the document numbers are looked for.
8
9

J. E. Holmstrom. The Royal Society Scientific Information Conference,


London. 1 9 ^ . p* 26^,
C, N. Mooers. Zatocoding for punched cards, Zator Technical
Bulletin Ho. 30. Zator Company, Boston. 1950.

-9-

Pairs of cards having coincidences are set aside, and these in turn
are then collated against the cards of the collection Mc". Cards hav
ing triple coincidences represent the desired documents.
It is one of the hopes of proponents of this method that the
time-consuming collation process can be held down b y using v e r y narrow
and precise descriptors. If this could be done* the unit card system
would be an excellent solution to* information retrieval.
For my part. I have at least two objections to the unit card
methodbesides the matter of the great overload on storage. The first
is that in all my ejperience working with retrieval systems. I have
found that the descriptors must be broad, not precise. This seems
fundamental to the whole retrieval situation, and enters in several
different ways. The second point is that large-scale mechanization
of the unit card system gives rise to difficulties with respect to
collation and the ease of insertion of new items in the storage sys
tem. This is inherently a matter of the use of machines* and it in
volves the sequential sorting problem. In particular* readjustment
of the record would become most difficult if the system went beyond
the use of cards into the use of a magnetic or film record. I bring
this up because a unit card system applied to more than one million
items must run into an enormous collection of cards* and in a project
of this size it would be desirable to completely mechanize the process
by the use of a film or tape record.
It can be concluded, though with some serious reservations*
that the unit qard system is the first system considered that can ac
tually meet the requirements set for information retrieval.
VIII - MICROFILM RAPID SELECTOR
This is the machine constructed for the Department of
Agriculture by Engineering Research Associates along the lines of an
earlier though similar machine by V. Bush. It is a very interesting
device. As an electronic machine, it can be criticised for being at
least an order of magnitude too slow in its speed of scanning. It is
slow by a factor of 100 as compared to the internal processes of the
BIUAC. Its present low rate of scanning of only 10*000 items per min
ute makes its present cost difficult to justify when compared to the
speeds of about 1*000 items p e r minute that can be attained in a
comparable selection situation when sorting cards by a simple hand
operated machine. It has been suggested, however, that succeeding
versions of the machine would be cheaper than the $75000 cost of the
first model.
The full mechanical details of the machine are to be found
-10-

in a report^- by Engineering Research Associates* We will consider


here only those mechanical details which have an impact on the theory
of information retrieval. In contrast to the method of mutually ex
clusive subfields already discussed* this machine might be said to
operate on the method of alternative subfields" .
The record medium is a reel of film. Each frame of the 35m
film is split in two# with one half carrying a photo reduction of a
typewritten abstract of a document and the citation. The other half
of the frame carries six subfields for digital designation of the des
criptors. Each subfield has 35 binary positions tliat can either be
blackened or left clear. With five binary positions per decimal digit*
a subfield can represent any seven-digit number, giving ten million
different codes.
The machine serially scans all the frames at a rate of about
180 per second, and a selective photocell arrangement scrutinizes in
series each subfield of every frame as it passes. The machine selects
when the specified seven-digit code pattern is found in any subfield
of a frame. Upon selection, a micro-flash lamp fires and a photographic
copy is made of the moving abstract. The machine selects according to
only a single seven-digit code* and there is no way to use two codes
for selection on a combination.
The method for coding the assignment of verbal meanings to
the seven-digit numbers is unusual. In explaining the method by
which he does this* Ralph R. Shaw the Librarian of the Department of
Agriculture, at the meeting of the American Chemical Society last fall,
pointed out there was apparently no foreseeable unanimity about schemes
for classification. Instead of waiting for problematic future agree
ments in this field, he said it was his aim to avoid conflict and to
apply what was already in wide use and acceptance. He bases his coding
system upon those large indexes that are already in operation for the
different scientific fields. For instance, in chemistry, he uses
Chemical Abstracts. Given the index, he takes a numbering machine* and
starting with the number 0000001 at "AAA11 he numbers each line and
entry of the entire index. Sub-entries are numbered serially after
the main entries, all in order. A n abstract is given up to Bix of
such codes, one in each subfield, and with no particular order for
the sequence of codes on the frame*
10
11

Anon. Report for the Microfilm Rapid Selector (Engineering


Research Associates* Inc*)* ITo* 97313* U* S. Department of
Commerce. 19*:9
The selector described by J. Samain (pp. 265-266* 68O -685
The Royal Society Scientific Information conference. London.
1948) also makes use of alternative subfields, but with a dif
ferent method of coding.
-11-

To perform a machine selection* it is first necessary to


look in the index to find the single word entry and corresponding code
that will give the selection desired. The one code is entered in the
machine. The machine then copies out those abstracts that have this
code in one of its subfields.
In effect* with this method of coding, the mechanism d.oes
only the equivalent of looking up page numbers and making a photographic
copy of the abstract. There is still the severe load upon the human
operator who must select the single index entry to define the selection.
All the difficulties that apply to the index method must ap
ply here, irrespective of the machine operation in the copying stage.
In terms of our formulation of the problem, the descriptors are not
independent. Each code really represents a complex of descriptors*
and the choice of descriptor combinations is restricted to those that
are already listed in the index. Unusual configurations of descriptors*
possibly not of importance when the original index as constructed* are
impossible to find, due to the coding method chosen. Thus in the
strict sense in which we are using the term, the technique is incapable
of efficient information retrieval.
IX - ATOMIC EHERC-Y COMMISSION JOINT PROJECT
These difficulties of the Microfilm Rapid Selector have been
realized, and there is now underway a joint project between the Depart
ment of Agriculture and the Atomic Energy Commission for the revision
of the present machine. The direction of these improvements is not
clear at this time, though there may be an effort to set up the machine
and its coding so that statements in the form of the propositional
calculus can be put into the coding and selection"^. In this field
there is a considerable amount of work that might be adapted to the
electrical or electronic realization of such logical relations, of
which the work of Shannon might be mentioned 3. Shannon showed
mathematically how a large range of propositional functions in the
logical calculus could be set up by realizable configurations of relays
and contactors.
V.'ith respect to this tentative line of approach, but without
reference to the AEC work, I might make some additional remarks. By
setting up an array of ideas in a logical structure symbolized by a
polynomial in the propositional calculus, one is in effect imposing a
grammar on the ideas. The two concepts are in a v/ay equivalent. It
has been my experience, from experimenting with modes of description
12
13

Mortimer Taube, A.E.C. Personal communication.


C. E. Shannon. A symbolic analysis of relay and switching circuits.
Trans. Amer. Inst. Electr. Engr. v. 57 PP 713 723 (1938)*

for machine selection, that unless the atomic ideas exist in a very
well-determined structure, the grammar can cause trouble by imposing
a "point of view". For instance.
1 eat a banana0 and "The banana
was eaten by me" mean exactly the same thing, though their form is
quite different because of the differing points of view. More complex
situations are even more difficult.
In chemistry the structures are quite determinate (at least
up to a certain point) and then grammar can often be used to advantage,
though even so it can be overdone.
X - ZATOCODING
Zatocoding is a new method of coding that I am very much
interested in, since I have been concerned vith its mathematical formu
lation. It is inherently a principle of coding rather than any specific
machine embodyment. It can be applied with a number of different
digital machines: electronic scanners, tabulating machinery, and even
with such simple hand-sorted punched cards as the one I am showing in
the Zator Company exhibit at Kutgers hero today. Zatocoding can be
briefly characterized as the coding technique which uses the super
imposition of random subject codes in a single coding field.
Zatocoding is a system of coding which was designed primarily
for information retrieval, and it has revealed the need for some drastic
changes in the conventional library postulates or doctrines. For
instance, in Zatocoding the unwanted bulk of the material in the file
is rejected according to statistical rules, rather than by the prin
ciples of the "exactness" implied in ordinary library systems.
Zatocoding is able to combine the best feature of the unit card system
(the independence of attributes) with the best feature of the method
of exclusive subfields (the use of a single card per information item).
Yet. Zatocoding is able to leave behind the mo3t serious disadvantages
of both methods: respectively, the many cards per information item in
the unit card system, and the indeterminacy of subfields in the method
of exclusive subfields.
The Zatocoding method is as follows: To each information
item there is delegated a single card which has a field for carrying
punches. Other carriers of digital information such as film could
be used auite as well. Each information item is characterized by a
set of attributes, which we can consider as having been written out
on the face of the card. There are as many cards as there are inform
ation items in the collection. The set of 411 the attributes used in
the whole collection forms a "vocabulary" of descriptive terms. Codes
are assigned to the attributes in the vocabulary by starting at the
top of the list and giving the first attribute a random pattern of
punches ranging over the field. The second attribute is given a
second pattern also ranging over the field, and generated randomly
-13-

and independently of the first.

And so on for each attribute in turn.

A card is coded by finding those patterns assigned to the


attributes written on the card, and by punching those patterns into
the single field of the card one on top of the other in superimoosition. Mathematically speaking, the patterns are combined by Boolian
addition in the single coordinate system of the field.
The feature of randomness of the codes in Zatocoding is very
important and merits additional discussion. It is not sufficient for
the cards, as punched out with the several codes, merely to give the
appearance of randomness in the dictionary sense: "without definite
aim, direction, rule, or method". This is not enough. What is re
quired for successful operation of the statistical selection process
in Zatocoding is (1) that the code patterns, taken individually, have
a mathematically random scatter of punches ranging over the field, and
(2) that the patterns considered with respect to the list of attributes
exhibit a mathematical randomness from one to another.
The reason for such a stringent requirement on randomness
is that in selection the statistical rejection of the unwanted cards
operates by means of the differing patterns of the selected cards as
compared to the rejected cards. The only way to guarantee code pat
terns that differ as much as possible among themselves is to produce
them randomly v/ith respect to each other. When this is done, the re
quired randomness will prevail no matter how the attributes may be
arranged in alphabetical lists, or classified by subject. Thus
"airdromes" and "airfoils" have entirely different Zatocoding patterns,
in spite of their alphabetical or subject contiguity, and in selection
a strong statistical discrimination is exerted betv/een them.
To return
to the pack of cards, in Zatocoding each card
is punched with a set of random patterns in superimposition in the
field, v/ith these patterns individually representing the attributes
descriptive of the subject content of the card. A Zatocoding selec
tion is defined by a single or multiple set of attributes, typically
tv/o or three in combination. To carry out a selection, the Zatocoding
patterns corresponding to the several selecting attributes are com
bined also by Eoolian addition, to give tho total selective pattern S.
Then if C is a typical pattern on a card, Zatocoding selection occurs
when the pattern C includes S. or in the notation of Boolian algebra,
when C *S=0, t/ith the apostrophe denoting complementation with respcct to the whole field of tho card.
14

See Garrett Birkhoff and S. MacLane, A survey of modern algebra,


Macmillan Co., New York 19^4. pp. 311-332.
-14-

Those cards are selected, which contain each and every one
of the selector attributes. If the patterns of these attributes have
been punched out on a particular card* the inclusion relation must hold
with this card, and it must be selected irrespective of the other pat
terns on the card. In this respect, selection is according to the
logical product of the selector attributes; e.g. all cards bearing
punches for large", "red", and "apples" simultaneously will come out
when these attributes are placed in the selector.
While all cards fitting the selector prescription must be
selected by the inclusion principle, the strict converse does not hold
v/ith repect to the exclusion of the unwanted cards. This peculiarity
of Zatocoding comes from the superimposition of many code patterns
in the single field of the card. There is an intermingling and over
lapping of the individual patterns. Because of this overlapping,
there is a finite statistical possibility that patterns having no
intellectual connection v/ith the desired patterns can combine to
simulate the configuration of punches in the cards having the desired
patterns. Such cards do select out, and are called "extra cards".
Eowever, and this is important, the relative frequency of such extra
cards with respect to the entire collection is under strict statis
tical control. Typically, in a selection on two patterns, the freouency will be .001 or less. Generally, where S is the total number
of positions in the selector pattern, the average ratio of extra cards
is always less than (1/2) , and often very much l e s s . 9
One might say that selection of cards by Zatocoding is
according to the logical product of the selector attributes plus
"epsilon", where epsilon can be made as small as desired by design
of the system. While Zatocoding selection is not exact from a per
fectionist' s standpoint, it is a good engineering solution to a
problemparticularly when epsilon can easily be brought to 10 ^ 0r
less if ever required.
Zatocoding, by accepting the existence of the inconsequen
tial epsilon, accomplishes these things:
1. There is no indeterminacy of subfielcs for the location
of an attribute on a card, because all the codes are in
a single coordinate frame.
2. To find an attribute, a selector mechanism need search
only in one location on the card.
3* Attributes are used entirely independently, both in
selection and in making up the card. Such independence,
in conjunction with good statistical control of extras,
is gained through the use of random code assignments.

-15-

4. A feature of great practical importance is the enormous


increase in the size of the usable vocabulary-9 as com
pared to coding methods such as the '
method of exclusive
subf ields11.
5. Selection by Zatocoding automatically is made according to
the logical product of the selector attributes, the natural
way f o r d e f i n i n g a selection.
6. Zatocoding leads to extrememly simple structures in the
selective machines, a matter of great importance to this
group of machine designers and builders.
Because all patterns with Zatocoding are carried in the
single coordinate frame of the field, the selector mechanism does not
require a 11subfield shifter11 so that it can look into several dif
ferent subfields on the card for individual patterns. Because of
pattern inclusion selection, a very simple digital pattern recogni
tion scheme is possible the simplest being represented by a mask
and a photoelectric arrangement in an optical system.
Simple structure quite generally means high-speed operation.
This is true of Zatocoding. The exhibit I have here at Rutgers shows
(as you perhaps have tried for yourself) that cards can be sorted at
a rate of around 800 per minute with a strictly mechanical device.
without recourse to electronics. This is very favorable when compared
with other systems, such as tabulating machines. On a frequency basis
binary digit field positions scanned per second this simple selec
tor is operating at about 530 digits per second.
Because of the extreme simplicity possible with Zatocoding,
it is possible to envisage an electronic selector method for scanning
a record at a rate of 10 digits per second using essentially our
present technology. This represents a scanning speed of approxi
mately a million coded fields per second. At this rate a subject
search of all the volumes in the Library of Congress would take less
than ten seconds, and a search of all known scientific papers for
all time would take less than five minutes.
XI - EPILOG
The conclusions with respect to information retrieval and
machine economics of various systems are already clear and need not
be dwelt upon here. The systems, as discussed, represent a graded
15 For a description, see Zator Technical Bulletin No. *K).
- 16 -

series of techniques ordered v/ith increasing sophistication of approach.


At the same time, there is a progressive lifting of the intellectual
load on the user of the retrieval system. Tor instance, a user having
decided upon a subject, need give much less intellectual attention to
sorting a pack of punched cards than he need give to the job of using
a highly cross-referenced card index system. This is all to the good,
for it should be the purpose of machines to remove the load of mere
drudgery from the minds of human beings.
V/ith respect to large-scale systems for the machine retrieval
of information, I believe that these things can be asserted: Infor
mation retrieval, in the useful sense defined here, is now possible
v/ith known systems and mechanisms. Large-scale high-speed machines
are being built, and greatly improved machines v/ill come from further
application of known techniques.
Machine builders and applied mathematicians have now taken
the lead v/ithout waiting for librarians to come to grips with or
formulate the solution to the large-scale machine retrieval problem.
They will soon be in the position of telling the librarians and
documentalists what the fundamental operational requirements of
retrieval are, of developing the theories that apply, and they are
well on the way to the production of useful machines to do the job.
Library science has been largely stalled for two millenia
with an organization principle which came from Aristotle, and storage
principles of the Ptolmaic librarians of Alexandria. Now v/e can
hope that the intitially successful departures into new methods
and machines for information retrieval will continue and expand.
Me can also hope that effort will at last be guided by the principles
of engineering and the scientific method instead of the outworn
metaphysics which has too long held sway through actual default on
the part of the scientists.
*

-17-

uv

Anda mungkin juga menyukai