Anda di halaman 1dari 7

Infonnation Retrieval

Phyllis Baxendale

A Relational Mudel uf Data for


Laree Shared Data Banks

A Relational Model of Data for


Large Shared Data Banks

E, F, Com

E.F. Codd

Research L<iboralory. San Jaae,

June, 1970
Volume 13, Number 6
pp, 377-387

Califami,

Fulure uiei^ of lar^e doTa banli^ mu^l be protpcfed from


huvng fo know' hov fhft dala n orgunized in Ihe mocjiine Une

In 1970. Codd proposed a new model for


database sysiems called the relational
model. Through its .^implicit}' and mathenuitical basis, the relatiomil tru)del lias provided an intuitively more appealing foundation for database systems titan its ty.'o major
competitors: the hierarchical and network
models. The model has had an enormous
impact on both the theory and development
of database systems. A growing number of
commercial database systems are relational. In I9S1. the ACM Turing Award was
presented to Codd.

unaffeded when me jnlerrcl represenlafjon ur data [i changed


ond ewfln when j o m t aspvcfs of ITIB e 11,1 e m I reproenlahDU

Traffic Una irclural grovn^ in the lype& or ttoifd


with Iref-itructuied

mformahun,

Hiss o i slightly more general

are discuued, A model bu&ed on n-ofy

ncrwoilt

felo'ioni, 0 normal

lEY WODDS AMD

CU CAlEGOIIfS,

3,?0, 3.?3, 3 7S, 1.70. i-27.

l.lt

1. Relational Model and No


This paper is concerned nith the apphcation of elamentaiy relation theory to systems which provide shared
access to large banks of formRtted data. Except for a paper
by Childa |1J, the principal appUcfttion of relations to data
ByetcmB h u been to deductive quest ion-answering systenu,
Levein and Marun [2] provide numerous references tu work
in this area.
In contrast, the pruhlems treated here are those of dala
independaicethe independence uf applieation programs
and tcnninal activities frum growth b data types and
changes in data repreacntationand eertain kinds uf daln
fnctmeisteiunj which iire expected to hccome trouhlesume
even in nundeductive systems.
25th Anniversary Issue

A further advantage uf the relational view is that it


furms a .^ound bafis for treating derivahility, redundancy,
and eopsistency of relationsthese are discussed ic Section
2, The network model, un the uther hand, has spawned fl
number of conlusions, not the lefist of which is mistakmff
the derivation of connections for the denvation of relations (see remarks in Section 2 on the "connection trap j .
Finally, the relational view permits a clearer evaluation
of the scope and logical limitations of present formatted
data systems, and also the relative merits (from a logical
standpoint) of competing representations of data withio a
single systm. Examples of this clearer pempeetive are
cited in various parts of this paper, frnplementations of
systems to support the relational model are not disciu^sed,
1,2,

and applied lo Ihe problerns of redundancy ond ccniittflncy

-D.E.D.

Communications
of
the ACM

The relational vien (or niodel) of data dcjcnbed in


Section 1 appears tu be superior in several respects to the
graph or network mudel |3, 4| presently m vogue for noninferential systems. It provides a means of describing data
with ita natural structure onlythat is, withnut superimposing auy additional structure for machine representation
purposes. Accordingly, it provides a basis for a high level
data language which will yield maximal independence between programs on the one hand and maehine representation and organization uf datii on the other,

January, 19S3
Volume 26
Number I

64

DATA DEPENDENCIES TN PREBE^TT SI-STEHS

The prnviaon of liata description tables in recently develuped informntion systems reprtsenlj a major adiance
toward the gual of data independence |5, 6, 7], Snch tables
facilitate changing certain eharacteristics of the data representation stored in a dnta hank. However, the variety nf
data representation characteristics whieh can be changed
tuiUuiut togically impairinii

some apptvaltffn

pra^^ams

ia

still quite limitod. Further, the model of data with which


useis interact is still cluttered with representational properties, particularly in regard to tho rapresentation of colleetions of data (as oppnsed to individual items). Three of
the principal kinds of data dependences which still need
to be removed are: ordering dependence, indexing dependence, and acce^ path dependence. In si>me Ryatema these
dependences lire nut clearly separable frem une another.
1,2.1, Ordmn^ Dependence. Elements of data in a
databank may be stored in a variety of ways, some involving no cotLCem for orderingn sumE permitting each element
to piirticipate in one ordering onlj, others permitting eaeh
element to participate in several orderintp. Let us consider
those existing systems which either rer]iiire or permit data
element' tu be stored in at least one total ordpring which \f
elosely associated with the liardware-determiiieil iinJering
of addresses. Fur example, the recurds uf a file concerning
parta might he stored in ascending order by part serial
number. Such systems normally permit application programs fo aMnme that the order of presentation of records
from such a file is identical to (or is a subonlcriiig of) the

stored or^enng, ITiuse applicatiun programs which take


advantage uf the stored ordering of a hie are likely to fail
to operate correctly if for sarae reason it becomes necessary
to repiaci? that ordering bv a different one, bimilar remarks
hold for n stored ordering implemented by means of
pUiDtTS,

Slructure I, Projecla Su]jordiD&1 lo P u


PART

NuiVp eonaider the problem of printing out the part


number, piirt name, and quantity committed for every part
iiicd in the project whose project nsme is "alpha," The
fulluwing obnCrvatiou.^ may be made regardless of which
available tree-oriented information system is selected to
tackle this prehleni. If a pregram P is developed for this
problem assuming one of the five structures abovethat
is, P makes no test to determine which structure is in effectthen P will fail on at least three of the remaining
structures- More specifically, if P succeeds with structure 5,
it will fail with all the others; if i" succeeila with structure 3
or 4, it will fail with at least 1, 2, and 5\\i P succeeds with
1 or 2, it ivill f^ nith at least 3, 4, and 5. The reason is
simple in each ca^e. In the absence of a ti^t tu dclermine
w hich structure is in effect, P fails because BTI attempt is
made to exreute a reference tu a nonexistent file (available
systems treat this aa an errer) ur no attempt is made to
execute a reference to a file containing needed information.
The reader who is not convinced should develop sample
programs for this simple prehlem.

pan f

pmitcl t
because all the well-knoq-n informatiun systems t^at are
marketed today fail to make a clear distinction between
order of presentation on the one hand and stored ordering
on the other. Significant implementadon problems must be
solved to provide thia kind of independence,
1.2.2, Indexing Dependence. In the context of for-

quanllty cocimitMd
structure 2. P u U SubnrdLnn
PROJECT

performance-oriented eumponent of the data representation. It tends t^ improve response tu qneries and updates
and, at the same time, slow down responsB to insertions
and deletions. From an infunnatiunul standpoint, an index
syitem uat9 indices at all and if it is to perform well in an
environment with changing patterns of activity on the data
bank, an ability to create and destroy indices from time to
time will probably be necessary. The question then arises:
Oan applieatjon programs and terminal activities remain
invariant as indices rome and gu?
Present furmatted data systems take widely diiTerent
appreaches to indexing, TDMS |7| nnconditiunally pro\qdes indexing on all attributes. The presently released
versiun of IMS |5| provides the user vi-ith a choice fur each
filei a choice hetween no indenting at all (the hierarehic sequential urganiEatiun ] ur indexing on the primary key
only (the hierarchic indexed sequential urganization). In
rieitlier case is the user's application Iugic dependent on the
existenee uf the uneonditionallv provided indices, IDS
[S|, however, permits the file designers to select attributes
to be indexed and to incorporate indices into the file structure hy means of additional chains. Applicatiun programs
taking advantage of the performance benefit of these indexiriR chains mustrefer to those chains by uume. Such programs do not operate correctly if these chains are later
1.3.3. .4cM3 Folk Dependence. Many uf the existing
formatted data systems provide useis with tree-structured
files or slightly more general network models of the data.
Application programs developed to work with these systems tend to be logieally impaired if the trees or networks
are changed in structure, A simple example foUowa.
Suppose the data bank contains information ahout parts
and projects. For each part, the part number, part name,
part description, qnantity-on-hand, and quantity-on-order
are recorded. For each project, the prejeet numher, project
name, project desciiptiun are recorded. Whenever a project
makes uae of a certain part, the quantity of that part committed to the given proiect ia aUo recorded. Suppose that
the systeni requires the user or file designer to declare or
define the data in terms of tree structures. Then, any one
of the hiemichical structures may be adopted fur the informatiun mentiuned above (see Structnred 1-5),

projon

part nune

Since, in general, it is not practical to develop application programs which test for all tree structurings permitted
hy the system, these programs fail when a chaiiEe in
structure becomes necessary'.
Systems wluch provide users witli a iietnurk mrxlcl of
the data rxin into similar difficulties. In both the tree and
network eases, the user (or his pmgraiTi) ia required to
exploit a collection of user access paths to the data. It does
not matter whether these patlis arein close currespoiidence
with pointer-delined paths in the stored representation-in
IDS the eorrrapondence is extremely simple, in TDMS it is
jnst the opposite-The con.4equenee, regardl^s of the stored
repre^Dtation, is that terminal activities and programs become dependent un the continued existence u( the user
acce^ paths.

3, PirU and
ubordinate Lu Proi
PART

part f
pirl deflrriptiori

PROJECT

prajtct I

PART

project dearrJptiDD
part f

Structure 4, PuIA and ProjecCA u Peer


NRurJtmeDt RelatiooBtiip Skiburdtnatc to Paris
PART

pin i

PROJECT

quaiilily-on-order
proiMl i

PROJECT

project f

One solution to this is to adopt the policy tliat once a


uwr access path is defined it ivill not be made ob^leUr until all applicivtioii programs using that path hax'e become
obJHolctc, Such a policy is nut practical, because the number
of acefss paths in tiie total model for the community of
user? of a dnta bank would eventually become excessively
large,
l.ij,

o unDtEtV'O n -huid

For expository reAsons, we shall frequently make use of


an array reprisentation of relation.'*, but it must he remembered that this particular representation is nut an easential part of the rclatioual view being expuunded, .Kn ar-

projtal f
project name

inly, R :

65

(5) The significance of each column is partially conveyed by labeling it with the name of the corresponding domain.
The eiampIc in Figure 1 iUustnktes u relation of dagree
4, called supplfip which reflects the shipmcnts-iil-progresi
of parts frem specified supplieis to specified projects in
specified quantities,
supply

It1ipplir

Torl

projl^l ir'anl.lg]

One might ask: If the columns are labeled by the name


nf corresponding domains, why should tho ordering of columns matter? Aa the e-\ample in Figure 2 shows, two columns may have identical headings (indicating identical
domains) but poaaess distinct meanings with respect to the
relation. The relation depicted is called coJnponejtl. It is a
ternary relation, whose fiist two domains are culled part
and third domain is called quanlUy. The meaning of componcnl (i, a, r) is that part i is an immediate component
(urauhAssembly) nf part T/, and 2 units uf part z are needed
to assemble one nnit of part j/. It is a relation which plays
a critical roie in tbe parts explosion problem.

VIKW u r U.KVh

The term relalum is used here in its accepted mathematical sense. Given sets S,,S,.
, S. (not necessarily
distinct), /f is a relation on these rt sets if it is a set of ntuples each of which has its fint clement from S,, ita
second element from S,, and su on-' We shall refer to S/ as
the jth domain of R. As delined above, R is said to have
degree n. Relations uf degree I are often called unary, degree 2 binary, degree 3 ttmary, and degree n n-ary,

Slruclure I. Firli, PraJDrti, and

PROJECT

-II HtLAIlOWAl,

ray which represents an n-iry relation R has the folloning


properties:
(1) Each row represents an n-tupJe of R.
(2) The ordering uf rowa is immatpria],
(3) All ixiws are du^tinct
(1) The ordering of columns is significantit corresponds t.> the ordering S,, S,, , - , S. of the domains un whicli R is defined (see, however, remarks
heluw on domain-ordered and domain-uiioitlered
relatiuns),

t o< tho Cartian producl Si X

F[U ?,

A rolation wuh two idenLioaJ domiiiLA

It is a renuirkable fact that several eiisting information


systems (chiefly thoso based on tree-structured filn) ful
to provide data representations for rotations which have
two ur mure identical dumains. The present veniion of
IMS/360 [6| is an example of such a system.
Tlie totality of data in a data bank may be viewed u fh
collection of time-varying relatioiw, Thesa relation* ars of
assorted degrera. As time progresses, each n-arj' relation
may be subiect to insertion of additional n-tuples, deletion
of existing one), and alteration of components of any of ila
existing n-tuples.

ILI many commercial, governmental, and scientific data


bauks, however, some of the relations are of quite high degTee (a degree of 30 is nut at all uncommon), Usera should
nut nurnisilly be burdened vjith remembering the domain
ordering of any relation (for example, the ordering ^upp^ifr,
then pnrl, then project, then ^antiiy in the relation .mpply),
Aecordingly, we propose that u.sers deal, not with relations
which are domain-ordered, but ith relalianships which are
thdr do main-unordered counterpartE,' To accomplish this,
domains must be umquely identifiable at least within any
BlveD relation, without using position. Thus, where there
are tn~u or more identical domains, we rcqnire in each case
that the domain name be qualified by a distinctive rok
name, which serves to identify the role played by that
domain in the given relation. For example, in the relation
compoTttnl of Figure 2, the nrat domain part might be
qualiSed by the role name sab, and the second by svper. so
that nsera oould deal with the relationship coinp(menl and
its domainssub,por! super,port, gnanUtywithout regard
to an?- ordering between these domains.
Tu sum up, it is proposed that most users should interact
with a relational model of the data eonsisting of a collection
of time-vatying relation&liips (rather than relations). Each
nser need not know more about any relationship than its
name together with the names of its domains (role qualified whenever necessary). Even this informatiun might he
oHered in menu style by the system (subject to security
and privacy eonstraints) upon request by the user.
There are usually many alternative ways in which a relational model may be established for a data bunk. In
urder to disciise a preferred way [or normal form), we
muat first introduee a few additional concepts (active
domain, primary key, foreign key, nonsimple domain)
ai>d establish some links ;vith terminology eurrently in use
in information systems progracnming- In theremainderof
tbis paper, we shall not bother to distinguish between relations and relationships except where it appeals advantageoiji to be eipUcitCunsider an example of a data bank whicli includes relations eonceming parts, prejects, and suppliers- One relation Galled purl is dc&ned on the following domains:
(1) part number
(2) part name
(3) part color
(i) part wciglit
(5) quantity on hand
(61 quantity on order
and pusaibly other domains es well. Each of these dumains
is, in effect, a puol of values, some or all of which may be
represented in the data hank at any instant. While it is
conceivable that, at some instant, all part colors are present, it is ooiikely that all possible part wiaghts, part
' In I

l ier

^ NatiiraUy, u wilh ^ny dAU pjl into and v


purer Byscem, tbo Liser vill nnnoaLl/ mAkf F,

names, and part numbers are. We shall call the set of


values reprcBsnted at some instant the odiue domain at that

and nonsimpie domain,respectively.Much of tha confusion


in present terminulugy is due to failure to distinguish between type and instance (us in "record") aild between
components of a user model of the data on the one hand
and their machine representation counterparts on the
other hand (again, ive cite "record" as an example),

ii^ormally, one domain (or combination of dumains) uf a


given relation has values which uniquely identify eaeh element (n-tuple) u( that relation. Such a domain (or cnmbinatiun) is called a primary key. In the example above,
part number would be a primary key, while part culor
would not be. A primary key is nonredundanl it it is either
a simple dumain (not a combination) or a combination
sueh that none of the partieipating simple domains is
superfluous in uniquely identifying each elemeTit, A relation may possess more than one nonredundant primary
key- This wuuld be the ca=e in the eiample if different parts
were always given distinct names. Whenever a relation
has twu or mure nonredundant primary keys, one of them
is arbitrarily seleeted and called Ua primary key uf that relation,

If normatiiation as described above is to be applicable,


the unnormaliied collection of relations must sattJfy the
following conditions:
(1) The graph uf interrelationships of the nonampJe
dom^ns is a collection of trees,
(2) No primary key has a component domain which is

1,4. NORMAL FORM


The writer knows of no application which would require
A relatiun whose domains are all simple can be repre- any relaxation of these cunditiuns. Further operations of a
sented in storage by a two-dimensional column-homonormalizing kind are possible. These are not discussed in
geneous array of tbe kind discussed above. Some more
this paper,
complicated data structure is nece-isary for a relation with
Tliesimplicityofthearrayrepresentation which becomes
oneor more nonsimple domains. For this reason (and others
feasible when all relations are cast in normal form is not
to be cited below) the possibility of eliminating nunsimple
only
an advantage for slorage purposes but aUo for comdumains appears worth investigating,* There is, in fact, a
munication of bulk dnta between systcnu which use n-idrly
very simple elimination procedure, which we shall call
different
representations of the data. The communication
norm aliiat ion.
form would be a suitably compre^ed version of the array
Consider, for example, the eolleetion of relations exrepresentation and would have the following advantagea:
hibited in Figure 3(a), Job hialory and childrm are nonsimple domains of the relation employee. Salary history is a (1) It would be devoid of pointers (address-valued or
nonsimple domain of the relation job history. The tree in displacement-valued),
(2) It would avoid all dependence on haah a d d r ^ n g
Figure 3(al shon-s jvist these interrelationships of the nonacliciiies,
simple domains.
(3) It would contain no indica or ordering lists.
If the user's relational model is set up in normal form,
names uf items of data in the data bank can take a simpler
form than would othenvise be the case. A general name
would take a form sucli as

A common requirement is for elements of a relation to


cruss-reference other elements uf the same relation ur elements of a different relation, I\eys provide a user-onented
means (but not the only means) of expressing such crussreferencea. We shall call a domain (or domain combination) cfrelation-R& foreign te^ if it is not the primary key
of R but its elements are values of the primary key of some
relation S (tbe possibility tliat S and R are identical is not
exeluded), fn the relation supply uf Figure 1, the combination of supplier, pari, pfojeii is the primary key^ while each
of these three domains taken separately is a foreign key.
In prei-ious work there has been a strung tendency to
treat the data in a data bank as consisting oF twu parts, une
part consisting of entity descriptiuns (for example, descriptions of suppliers) and the other part consisting oi relations between the variuns entities or types of entities (for
example, the supply relation). This distinction is difficult
to maintain when une may have foreign keys in any relation whatsoever. In the user's relational model there appears to be no advantage to making such a distinction
(there may be sume advantage, however, when one applies
relational cuncepta to machine representations of the user's
set of relationships).

employe hnanf. namo, birthd&le, jobhistory, children)


jobbiEtory (jctdali, title, sslnrybiElory)
BalaryhiBtory (Matoryiklls. salary]
cbildren (cAiMnomf, birtbyear)

AiDpkoyce' {pinnf, name, birlnd


jobhiBtory' (manf, iabilaU. lill
alarybiBlDry' (numf, }tJbltat^. lalarydalt,

cbiidren' (numf, childnami. bir

1.5.
salnry)

FIO SOI). Nomdiied set

So far, we have discusBed examples of relations whieh are


defined on simple domainsdomains whose elements are
atomic (nondecomposable) values, Nonatumic values can
be discussed within tbe relational framework. Thus, sume
domain.^ may have relations as elements. These relations
may, in tum, be defined on nunsimple dumain-4, and su un.
For example, one of the domains on which the relation employee is de^ed might be salary history. An element uf the
salary' history domain is a binary relation defined untiic domain dale and the domain salary. The aa/on/ history dumaio
isthc^et of all such binary relations. At any instant of time
there are as many instancts of the saiary history relation
in the data bank as there are employees In contrast, there
is only one instjLncH of tbe employee relation.

Normalization proceeds as follows. Starting with the relation at the top of the tree, take its primary key and expand each of the immediately subordinate relations by
inserting this primary key domain or domain combination,
Tbe primary key uf eaeb expanded relation consists of the
primary key before expansion augmented by the primary
key copied down frem the parent relatiun, Nuw, strike out
from the parent relation all nunstmpledumains,removethe
top node of the tree, and repeat the same sequenee of
operations on each remaining subtree.
The result of normalizing the ooUection of relations in
Figure3(a)is the collection in Figure 3 (b). The primary
key of each relation is italicized to show huw ouch keys
are expanded by the normal izAtion,

The terms attribute and repeating group in present data


base terminology are roughly analogous to simple domain

M, E, Smko nf IBM, San JoM, indepandently recuKniiei Ibe


deflirabilily of tiiminalins nonaiiDple domaina.

66

where R is a relational name; i; is a generation identifier


(optional); r is a role name (optional); d is a domain name.
Since a is needed unly when several generations of a given
relation exist, or are anticipated to exist, and r is needed
only when the relation R hsa to or more domains named
d, the simple form R.rf will tiften be adequate,
SOME LINDUISTIC ,,\flFEc-r3

The adoption of a relational model of data, as described


above, permits the development of a nniversal data sublanguage based on an applied predicate ealeulus, A fit^torder predicate calculus suf^ces if the collection of relations
is in normal form. Such a language would previde a yard'
stick of linguistic power for all otlier proposed data languages, and ivould itself be a ftrong candidate for embedding (witli appropriate syntactic modification) in a variety
of host langviageB (pregramming, command- or problemoriented). While it is not the purpose of this paper to
describe such a langnage in detail, its salient features
would he as follows.
Let us denote the data sublanguage by R and the host
language by H, R pernutd the declaration ofrelationsand
their domains. Each declaration of a relatiun identifies the
primary key for that relation. Declared relations art added
to the system catalog for use by any members uf the user
community who have appropriate authorisation. H permits suppurting declarations which indicate, perhaps lvi
permanently, how these relations are represented in stur.

age, B permits tlie specification for retrieval of any subset


of data from the data bank, Actiuii on such a retrieval request is subject to security constr^nta,
Tlie imiveisality of the data sublanguage Iic5 in it.^
descriptive ability (nut ite cumputing ability). In a large
data bank each snbset of the data ha^ a very large number
of possible (and sensible) deseriptions, even wben ^I'e assume (as we do) that there is only a finite set of function
subroutines to which the system has acreA< for use in
qualifying data for retrieval. Thus, the clasf, of qualification
expressions which can be lispd in a set specification must
have the descriptive power uf the class of well-formed
formulas of an upplicd predicate ealculus. It is well knuwn
that to preserve this descriptive power it is unncces&arv to
OLprPss (in whatever sjTitax is ehoaen) e^'ery formula of
tlie selected predicate calrulua. For example, jii.'^l those in

nd Con

ternary relation which preserves all of the information in


the given relations?
The example in Figure 5 shows two relations R, S, which
are joiriable without loas cf information, while Figure 6
shews a join of R with S, A binary relation R is jouwftfc
with a hinary relation S if there exists a ternaryrelatiuntJ
auch that m d / ) = R and rn((/j = S, Any such ternary
relation is called a jam of R n'itli S, If R, S are binary relations such that n{R) = .,(5), then R is juinable n-ith 5,
One join that always exists in auch a case ia the natural
join of R with S defined by
tl.S = |(a,fc,c):B(n, 6) A Sfb, c)|
where R(a, 6) has the valne true if (a, M is a member of R
and similarly for 5(6, c). It is immodiatn that

4-a(y relation nippli, uf Figure 1, which entails 5 names in


jt-ary notation, would be represented in the form

2. Rtdunda

P (nippher, 0 (part, R {project, quanlity)))


in nested binary notation and, thus, employ 7 names,
.\ further disadvantage uf this kind uf expression is ita
asjmnietrj'. Although this asymmetry dots nut prohibit
symmetrie exploitation, it certainly makes some bases of
interrogation very awkward for the user to express (consider for example, a querj' for those parts and quantities
related to ceriain given projects via Q and R).

Since relations are sets, all of the usual set Ujiemtions are
applicable to them. Nevertheless, the result may not be a
relatiun; for example, the union of a binary relation and a
ternary relatiun is not a relatiun.
The opexatiuns discussed below are specifically fur relations. These operations are introduced because of their key
rule in denvmg relations from other relations. Their
principal application is in noninferential information systems^systems which do not previde logical inference
servicesalthough their applicability is not iieccssarity
destroyed ivhen such services are added.
Most usera would not be directly concerned with these
operatinn.4. Information systems designers and people conri,(R.S) - R
cemed with daU bank control should, however, be thoroughly familiar with them.
2.1.1, PmmiliUiim. A binary relation has an array
representation with two eolumns. Inlerehanging thete colNote that the join shown in Figure 6 ia the tiatur^ join
umns yields tbe converse relation. More generally, if a
f ;; "ith .'^ frum Figure 5, Another join is shown in Figure
permutation is applied to the columns of an i-arv relatiun,
the resulting relation is said to be a permMltdion of the
given relation. There are, for example, 4! = 24 permutations of the relation mpply m Figure J, if we inobde the
identity permutation which leaves the ordenng of columns
unchanged.
Since the user^s relaliunal model consists of a collection
of relationships (domain-unordered relations), permutation IS not relevant to such a model considered in isolation.
Via. 4, A parmurAd
on of [he relaliou in Figun L
It is, however, relevant to the conaderation of stored
representations uf the mudel. In a systeni which pmvides
symmetric exploitation of relations, the set uf queries
ana^verable hy a stored relation is identical to the set
answerable by any permiitatinu of that relation. Although
it i logically unnecessary Ui store both arelationand some
permutation of it, performance considBration,^ euuld make
it advisable,
2.1.2, Projection. SupiKiw nuw we select eertain columns of a relation (striking out the utiicrs) and then remove from the resulting array any duphcalion in the rows.
The final arruy represents a relation whieh is sud to be a
projtction uf the given relation,
A selection operator T in used Co obtain any desired
permutation, projection, or combination of the two operations. Thus, if L is a list of Jt indicw' L ^ i|. t,, ,i,
and R isan n-ary relation (n > t), then ii(fi) is thct-ary
relation whuse Jth columnLicolumni,ufR(j- 1,2, ,, ,k)
except that duplication in resulting ruwE isremoved-Cunaider tlie relation sapply of Figure 1 A permuted projection
of this relatiun is exhibited in Figure 4, Note that, in this
particular caae, the projection has fewer n-tuplcs than the
relation from which it is derived,
Flu. 1, Aniithi-i join ai It wilh S [frnin Figun i)
2-lJJ, Join. Suppose we are given twu binary relations, which have aome domain in common, Undec what
Inspection of tlitse relations reveals an element (elecireumstanctB can we combine these relations to form a
ment 1) of the domain part (the domain on which the join
is to be made) with the preperty that it poft^eaee mure
'Wben dealini intb
dum
(lolithaTi one relative under R and also under S. It LI this cleaf dora

1.6,

2,1,

EXPRESSIBLE, N,uiEn, ANU STORED RELATIONS

Assouated with a data bank are two collections of relations: the named set and the expressible >fl- The named set
is the collection of all those rdatiuns that the eommunity uf
users can identify by means of a simple name (or identifier),
ArelatiunA aequires membership in the named set when a
.\ritlimetjc functioii-H may bo needed in the quahfication
suitably authoriiod user declares R; it loses membership
or otiier parts of retrieval statements. Suth funetions can
when a suitably authoriied user cancels the declaration of
be defined in H and invoked in R.
R
A si-t soflpcsifiedmay be fetched for query purposes
The expressible set is the total collection of relatiuns that
unly, or it may be held for pus.sib!e clianges- Insertions take
can be designated by expressions in the duta language. Such
the form of adding new elements to declared relatiuns withexpressions are constructed from simple nam^ of relatiuns
out regard to any ordering that may be present in their
m the named set; names uf generatiuns, rules anil domains;
machine representatiuii, I!)eletions u'hich are effective for
logical connectives; the quantifiera uf the predicate calcutbe community (as opposed to the individual user or sublus ;' and eertain constant relatiun symbols such as =, >,
communiti^) take the form ofremovingelements from deThe named set Ls a subset of the expressible setusually a
clared relations. Some deletions and updates may he trigvery small subset.
gered by otiiers, i/ deletion and update dependencies beSince sume relations in the named set may be time-indetween ipecified relations are declared in R.
pendent combinations of others in that Bet, it is a^ful to
One important effect that tbe view adopteil toward data
consider aairaciatini; with the named set a collection of
has un the language used to retrieve it is in the naming uf
statements that define these time-independent constraints.
data elements and sets. Some aspects of this have heen disWe <hall postpone further discussion of this until we have
cussed in the previous section. With the nsual network
introduced several operations on rektiona (see Section 2).
\iew, u^rs will often be burdened mth coining and using
One of the major prohlems confronting the designer of a
more relation names than arc absolutely jieceiiar>, since
data sj-stem whieii ia to support a relational model for ita
narnCH are a.'^ueiatfd ^'itJj paths (ur path t}'pes] rather
users i^ that uf determining the class uf stored representa,
Ihan with relations,
tions to be ^lupported- Ideally, the variety of permitted
Onee a ueer ia aware that a certain relatiun is stored, he data representations should he just adequate to cover the
" ill expect to he able to exploit' it using any cumbination
speetmm of performanci requirements of the total eoluf its arguments; aa "kuowiu" and the remaining argulectiun uf installations. Too great a variety Ifads to unment? a,^ "unknon'ns," becaiLse the informatiun (like
necessary overhead in storage and continual reinterpretaEverest) is there. This is a system feature (mining frem
tion of descriptions for the BtructuiTS currently in effect.
mnny current infonnation syatems) wbich we shall call
(lugically) 8j/T7initric eiploiUiLion of reUtions, Xnturally, For any selected elass of stored representations the data
sv-stem must provide a means of translating user reouesbi
e^'mmetry in perfurniance is not to be es^iected,
expre.^sed in tbe data language of the relational model into
Tu support symmetric exploitation of a single binary recorresponding and efficient-actiuns on the current
lation, two directed paths are needed- For a relatiun uf destored representation. Fur a high level data language this
gree n, the number uf paths to be named and controlled is
presents a challenging design problem- Nevertheless, it is a
n factorial.
problem which must be solvedas mure UAST^ ubtain conAgain, if a relational view is adopted in which every ncurrent acrsa to a large data bank, responsibility for proary relation (n > 2J has to be expressed by the user as a
viding efficient response and throughput shifts frum the
nested expression involving only binary relations (see
individual user to the data system.
Fcldman'B LEAP System 110], for esample) then 2n - 1
names have to be coined in-^tead of only n + 1 with direct BeriuH nuh celalioa in a pruticij il~la hitik la i flnitc ael al
n-:kry nutation as described in Section \.'l. For example, the

67

ney

OPEFUTIONS ON RELATIONS

ment which gives rise to the plurality of joins, Sueh an element in the joining domain is called a pnint of ambiguity
with respect to the joining uf R with S.
If either IU (Rl ur S i.s a function,' no point of ambiguity
can occur in juining R with S. In sueh a case, the natural
join of R itli S is the only juin of R with S, Note that the
reiterated qualification "of R with S" is necessary, because
S might be joinuble with fi (as well aa R v,ith S). and this
join would he an entirely separate consideration. In Figure
S, none ofthe relations R, iii(fi), ,5, jra(S) is a function.
Ambiguity in the joining of R with S can sometimes be
re?ulved by means of other rel.itions- Suppose we are given,
or eau derive from source independent of R and 5, a relation 2' on tbe domains project and supplier with the following properties:
(1) liT)

natural composition' ofR with S defined by

y), and T with R (say ; ) , and, furthermore, y must be a


relative of x under S, ; a relative of y under T, and r a
relative of i under It. Note that in Figure S the points
X = II', y = d. z = 2 have this propertyThe natural linciir 3-join of tbi-ee binary ralaticns fi, S,
T is g i ^ n by
R'S'T

= \{ii,h,c,d):R{a,b)

RS

Taking the relations R, S from Figure 5, their natural composition is exhibited in Figure 10 and another eompusitiun
is exhibited in Figure 11 (derived from the join exhibited
in Figure 7),

A S(&, c) A Tl_c,(l)]

n-here parentheses are not needed on the left-hand side because the natural 2-join (] is associative. To obtain the
cyclic counterpart, we introduce the operator y which produces a relation uf degree n 1 from a rolatiun of degree n
by tying its ends together. Thus, if K is an n-ary relation
(" > 2), tbe lie offi is defined by the equation

tioQ of R willi 3 (iion Figure 5)

Fiu. 10, Tbe nalLiral ,

= MS),
i ot R wiLb S (from

FiQ l l , Anolh

(2) ,,(T) - MR),


(31 rij,s)-.3p(R(S,p)

AStpJ)).

(4) R(a,p)-3j(S(p,j) A T{j.>)).

T(R.S.Tj.

then we may form a three-way join of fi, S, T; that is, a


ternary relation Bnch that

Extension of tbe notions of hoear and cyclic 3-join and


their natural counterparts to the joining of n binary relations (where Tt ^ 3) ts obvious, A few words may be appropriate, however, regarding the juining of relations which
are not necessarily binary. Consider the case of two relations fi (degree r), S (degree sl which are to be joined on
p of tbeir domains (p < r, p < s). For simplicity, suppose these p domains are the last p of the r domains of fi,
and the first p of the s domains uf S. If thia were not so, we
could always apply apprepriate permutations to make it
so. Now, take the Cartesian preduct of the first r-p domains of R, and eall this new domain A. Take the Cartesian produet of the last p domains of fi, and call this B.
Take the Cartesian product of the last s-p domains of S
and caU this C.

r^(U) = S,

((/l = T.

Such a join will be called a cyclic 3-join to distinguish it


from a linear 3-jciin which would be a quaternary relation
1' ^uch that
= fi,
= S,
While iti? possible for more tbiin one cyelie 3-juin to exist
(see Fignres 8,9, for an example), the circumstanees under
which this can occur entail much mure severe constraints

\\h a p]ijrnlity o\ ryclii: H-jnina


U'

1. p ()
1 a d
2 a d

Flo. 9. Two cyclic 3-)oin3 o( lbs .tHiona in Fisure 8

than those for a plurality of 2-juiiis, To bo specific, the relations R, S, T must posMas puinta of ambiguity with
respect lo joining fi n-ith S (say point i), S with T (say

When two or
joins exist, the number of distinct
poaitiun ay he as few as one or as many as the number uf distinet joins. Figure f2 shuws an esample of two
relatiuns which have several jums hut only one composition,
^ute that the ambiguity of point c is lost in composing R
with S, becaus of
biguous associations made '.'ia the
points a, b, d,

We may nuw represent the natural eyclic 3-juin uf R, S, T


by the expressiun

(,i) S ( p , j l - 3 s ( r ( j , 3 ) A R(s,p)).

ir,,(U) - R,

= Ti,(fi.S),

Extension of composition to pairs of relatiuns whicb are


not necessarily binary (and which may he uf different degrees) follows the same pattern as extension of pairwise
joining to sueh relations,
A laek of understanding of relational composiLiun has led
several systems designets into what may be called the
connt^itm trap. This trap may be described in terms of the
following example. Suppose each supplier description is
linked by pointers to tbe descriptions of each part supplied
by that supplier, and each part description is similarly
linked to the descriptions of each project which uses that
part, A cunclusiun is nuw drawn which is, in general, erruneous: namely that, if all possible paths are followed from
a given supplier via the parts he supplies tu the projects
using those parta, une will obtain a valid set uf all projects
supplied by that supplier. Such a conclusion is correct
only in the very special case that the target relation between projects and supplier is, in fact, the natural composition of the other two relationsand we must normally
add the phrase "for all time," because this is usufilly implied in claims concerning path-following techniques,

We can treat R as if it were a binary relation on the


domains A, B. Similarly, we can treat S as if it were o binary relation on the domains B, C. The notions of linear
and cyclic 3-join are now directly applicable, A similar approach can be t^en with the linear and cyclic ri-joins uf n
relations of assorted degrees.
2,1,4, ComposUion. The reader ia probably familiar
with the notion of composition applied to functions. We
shall discuss a generalization of that concept and apply it
first to binary relations. Our definitions uf eompusitiun
and composahility are based very directly on the definitions
of juin and joinability given above.
Suppose we are given two relations fi, S. T is a compaailionoffi with S if there exists a join F/offi with S such
that T = ralJJ). Thus, two relation.i are composable if
and only if they are joinable. However, the existenee of
more than une join uffiwitb S does not imply the existenee
of mure than one composition of R with o.
Corresponding to tbe natural juin of R with S is the

tht rompDBilionAee, for eARmple, Kellcy'a "Gdnernl TopolDy,"

68

2,U, Restriction. A subset uf a relation ia a relation.


One way in which a relation S may act on a relation R to
generate a subset of fi ia through the operation rcitnction
uf R by S. This operatiun is a generaliiation of the restriction of a function to a subset of its domain, and is defined
as follows.
Let L, M be equal-length lisla of indices such that
L = il, Is , ,ii,M - ji.ji.
- , Jl where t degree
of fi and Jt S degree of S, Then the L, M restriction of R by
S denoted R J v S is the ma,iimal subset R' of R anch that
The operation is defined unly if equality is apphcable between elements of T , . ( R ) un the one hand and r,.(S) on
the other for all A = 1,'i, - - , k.
The three relatiuns R, 5, fi' of Figure 13 satisfy the equation R' = fi,..i,l,i,wSR (

j)

j)

! p J)

FIO. 13. E.araple ot reatriciion


We are nuw in a position to eonsder vanous apphcatiuns
of these operations on relations,
2,3, REnuNDANcr
Redundancy in the named set of relatiuns must be distinguished from redundancy in the stored set of repr^^entAtions. We are primarily concerned here with tbe former.
To begin with, we need a precise nntiun of derivability for
relations.
Suppose 0 IS a eolleetion of operations on relatiuns and
eacb operation bas the property that from its operands it
yields a unique relatiun (thus natural join is eligible, but
join is nut), A relatiun R isB-derviable frum a set 5 of relations if therH e,xists a sequence of operations frem the colIcetion e which, for all time, yields fi from membera uf S.
The phrase "for all time" is present, because we are dealing
with time-varying relatiuns, and uur interest is in derivability whieh holds over a significant period of time- For the
n&m<d set of relationships in noninferential systems^ it appeara that an adequate collection Si contains the following
uperatiuns: projection, natural join, tie, and restrietiun.
Permutation is irrelevant and natural composition need
not be included, because it is obtainable by taking a natural
join and then a projection. For the stored set uf representations, an adequate collection Ci of upemtions wuulJ include
permutation and additional operations concerned wilh subsetting and merging relations, and ordenng and connecting
2,2,1. Strtmg Red^tndancij. A set of relations is atrangly
redttndant if it contains at least one relatiun that posse^s^
a projection whieh is derivable frum other projections of
relations in the set. The fuUoiving two examples are intended to explain why strong redundancy is defined this
wy, and tu demonstrate its practical use- In thefiratei-

ample the collection of relati


ing relatiun:

nsists of just the follow-

emphyfe iserinl , name, managers, t


with serialf as the primary key and managerjf as a foreign
key. Let us denote tlie active domain by &,, and suppose
that

ili(nMiiio5eniant<) C ^.(minie)
for all time t. In tliis case the redundancy is obvious: the
domain Tnanagenarne is unncceasarj'. To see that it is a
strong redundancy as defined above, we uhserve that
ruiemployee) = vi!tf>nployee)i\iriiemployee).
In the second esample tlie collection of relations includes a
relation S describing suppliers with primary key sff, a relatiun D deseribing departments with primary key df, a
relation/ describing projects with primary key j j , and the
following relations:

- ) ,

where in each case denotes domwns uther than it, df,


jf. Let us suppuse the following condition C is known to
hold independent of time: supplier t supplies department
d (relation P) if and only if supplier ssuppUea some projeet
j (relationQ)tu whieh d is assigned (relation fi). Then, n e
can write the equation

and thereby exliihit a strong redundancy.


An important reason fur the existence of strong redundaneies in the named set uf relationships is user eonvenicnee, A particular case uf this B the retention of semiob'^olete relationships in the named set so that old programs that refer to them by name can continue to run correctly. Knowledge of the existence of strong redundancies
in the namai set enables a system or data base administrst-ir greater freedom in the selection of stored representations to cope more efficiently with cnrrant traffic. If the
strong redundancies in the named set are directly reflected
in strong redundancies in the stored set (or if other strong
redundancies are introduced into the stored set), tben, generally speaking, extra storage space and update time are
consumed with a potential drop in query time for some
querio and in load on the ceutnil prucessing nnits,
2J>,2. Weak ReiiunJancy. A seeond type of redundancy may exist. In contrast to strong redundancy it is not
characteriwd by an ei|uation, A collection uf relations La
vKokly redundant if it contains a relation that has a projection which is not derivable from other mcmbeis but is at
all times a projection uf some join of utJier projections of
relations in the collection.
We can exhibit a weak redundancy hy taking the second
example (cited above) for a strong redundancy, and asaiimine now that condition C does nut hold at all times.

The relations r i , ( f ) , i , , ( 0 ) , .,,(fi) are cumplei"'relatiuns


with the possibihty of points uf ambiguity occurring from
time to time in the potential joining uf any tv,-o. Under
these circumstances, none of them is derivable from the
other two. However, constraints do exist between them,

example Bill show the reasonableness of this (possibly unconventional 1 approach to consiatenc)'.
Suppose the named set C includes the relations S, J, D,
P. Q, R of the example in Section 2,2 and that P, Q, R
pos.'^ess either the strong or weak redundancies deecrihed
therein (in the partieular case now under consideration, it
docs not matter which kind of redundancy occurs)- Further,
suppose tbat at some time I the data bank state is consistent
and contains no project j such tbat supplier 1 supplies
ftroject j and j is assigned to department ?t. Accordingly,
there is no element (2,51 i n i i i [ P ) , Now, a user introduces
the element (2,5) intoir,,(P) by inserting some appropriate element into F. The data bank state is now inconsistent.
The incunsisteney could have arisen frum an aat uf omissiun, if the input t'l. 5) is correct, and there does exist a
project j'sucli that supplier 2 suppliffi ; and J is assigned to
department 5, In tbis case, it is very likely that the user
intends m the near future to insert elements into Q and H
hich will have the effect of introducing (2, j ) into .i,(Q)
and ( 5 , j ) i n T , , ( R ) , On the other hand, the input (2, 5)
might have been faulty. It eould be the case that the user
intended to insert some other element into Pan element
whose insertion would transform a iMinsistent state into
a consistent state. The point is that the system will
normally have no way of resolving this question without
intermgating ita environment (perhaps the user whu created the inconsistency).

them. One of the weak redundancies can be characteriied


by the statement: for all time, iru(P) is some eompnsition
of B'u(Q) with ru(R 1- The eomposition in qutotion might
be the natiual one at some instant and a nonnatural one at
another instant.
Generally speaking, weak redundancies are inherent in
the logieal needs of the community of users. They are not
removahle by the system or data base administrator. If
they appear at all, they appear in buth the named set and
tbe stored set of representations,
2,3- CONSIBTENCT
Whenever the narued set of relatiuns is redundant in
either sense, we shall associate with that set a collection of
statements which define all cf the redundancies which hold
independent of time between the member relations. If the
information systeni lacksand it most probably willdetailed semantic information about each n a m ^ relation, it
cannot deduce the redundancita applicable tu the named
set. It might, over a period uf time, make attempts to
induce the rcduudancics, but auch attempts wuuld be fal.
lible.

hfany questions are raised and left unanswered. For


e?<amplfl, only a few of the more important prepertiea of
the data sublanguage in Section J.t are mentioned. Ndther
the purely linguistic details of such a langiui^ nor the
i in piemen tut ion prehtems are discussed. Nevertheless, the
material presented should be adequate for experienced
systems programmers to visualise several approaches. It
is also hoped that this paper can contribute to greater precision in work on formatt-ed data systemsAcknoaiedgmeni. It was C. T, Davics of IBM Poughkeepsie who convinced the author of the need for data
independence in future information systems. The author
wishes to thank him and also F. P, Palermo, C, P, Wang,
E, B. Altman, and -M, E- Senko of the IBM San Joae Researeh laboratory for helpful discusaions,

RECUVID SipnuiitiL, lUGD. uviacg Fa^nuAni, 1U70


REPERENCES
1. CHILCB, D . L , Feuibilily at tsct-tlicoroiical dataitniclun

There are, of course, several possible ways in which a


system can d c t ^ t 'jiconsistencies and respond to them.
In one approach the system checks for possible inconsistency whenever an insertiun, deletion, or key update oecurs.
Naturally, sucb checking will slow these operations down.
If an ineonsistency has heui generated, det&ils are lugged
internally, and if it is not remedied within some reasonable
titne interval, either the user ur sumeone responsible for
the seetirity and integrity of the data is notified. Another
approach is tu conduct con.siitency checking as a batch
operation onee a day or less frequently. Inputs cauMiiK the
inconsistencies which remain in the data hank sEate at
checking time ean he tracked doii'n if the system maintains a joumal of all state-changing transae tions. This
latter approach would certainly be superior if few nontransitory inconsistencies occurred,

Given a collection C of time-varying relations, an associated set 2 of constraint statcmen ts and an instantaneons
value V fur C. we shall call the state (C, Z, V) aauislent
or ifivtnsistfnt according as V dues or does not satisfy Z.
For example, given storod relations R, S, T together with
the constraint statement "ii(T'| is a composition uf
iri,(Rl with r^,(S)", we may cheekfrom time to time that
the values stored fur R, S, T satisfy this constraint. An algorithm for making tlii3 check would examine the first two
columns of each of R, S, T (in what*\-er way thoy are represented in the system) and determine whether

(1) MT) - MB),


(2) MT) = MS).
(3) for every element pair {a, e) in the relatiun ru(7')
there is an element b such that (a, j>) is in ru(R)
and ((., i;)isinitu(S),

2,4.

SHMMAHV

Tn Section 1 a relatiunal mudel of data is proponed aa a


basis for protecting usera of formatted data systems from
tbe potentially disruptive changes in dnta representation
caused by growth in the data bank and changes in traffic.
A normal form for the time-varying coiloction of relationships is introduced.

There are practical prublems (whicb we shall not discuss


here) in taking an instantaneoi;^ snapshot of a collection
of relations, some uf wluch may be very large and highly
variableIt is important to note that consistency as defined above
is a property of the instantaneous star of a data bank, and
ifl independent of how that state came about- Tlius, in
particular, there is no di-^tinction niade on tho hasis of
whether a user generated an inconsistency due to an act of
omi^^ion or an art of commission. Examination of a simple
A binary i

In Section 2 operations on relations and two typcfl of


redundancy are defined and applied to the pruhlem of
maintaiiting the data iu a consisteut state. This is bound to
become a serious practical problem as more and more different typen of data are integrated together into common
data banks,

npit il noilbo

69

iFlation, Pruc, [>'IP Cong., IMS, .Nortb Holland Puti- Co,,


AmalFidun, p 103-172,
2- 1.IVIIN, R. E,, /.m MAHUN, M, E , A compuxr syaiom for
infareuce tufcution and data r(tHtal, Cumn, ACU 10,
11 { N D . , 1 M 7 ) , ! 1 6 - 7 2 1 .

Datamation (Apr. IWA), 3&-1L


4- MCCEE, W. C, Gorcraliiod Mo prooouio(. In Aifiuat RiNow York, IBaS, pp, T7-HD
fi- IfiEorfnalion MnuqgBmcnl Systom/SOO, AppiicpitiDn DoAcripliou Murmal HM-OSM-l- IBM Corp., Wbilo Plains, N, Y,.
July IMS,
B CI^ (CDncriliiod Inlurniation SyBtim), Appiiealion Dcriplion .Minuai B2O-O57*- IBM Corp., Wbiln PLaina, N, V..
IMI.
1- SLEJEH, R. E , Troaling biorarrbicsl data Itructunw in tbo
SDC tine-ibind d.t Dianaoiiionl lyittu (TDM3),
Proo. ACM ZInd Nat, Cant-, tM7, MDI PublicitiDDl,
Wayne, Po,,pp. m e ,
8. IDS ItilBTO.ice Manual GE eiS/BU. GE lularm. Syl. DIT,
Pbooni., Aril-. CPB 1093B, Fob, 158,
tgn U- PrtM, Princoton. N.J,, ISM
la. FELDU^H, J- A-. INC HoiNIR, P D. Au Alial-bml aauoialivo iuiKuagc Slanlord Artilii iii [ntolligcni:> Ri-p-Al-U,
Aui 1, 1968.

Anda mungkin juga menyukai