Phyllis Baxendale
E, F, Com
E.F. Codd
June, 1970
Volume 13, Number 6
pp, 377-387
Califami,
mformahun,
ncrwoilt
felo'ioni, 0 normal
CU CAlEGOIIfS,
l.lt
-D.E.D.
Communications
of
the ACM
January, 19S3
Volume 26
Number I
64
The prnviaon of liata description tables in recently develuped informntion systems reprtsenlj a major adiance
toward the gual of data independence |5, 6, 7], Snch tables
facilitate changing certain eharacteristics of the data representation stored in a dnta hank. However, the variety nf
data representation characteristics whieh can be changed
tuiUuiut togically impairinii
some apptvaltffn
pra^^ams
ia
pan f
pmitcl t
because all the well-knoq-n informatiun systems t^at are
marketed today fail to make a clear distinction between
order of presentation on the one hand and stored ordering
on the other. Significant implementadon problems must be
solved to provide thia kind of independence,
1.2.2, Indexing Dependence. In the context of for-
quanllty cocimitMd
structure 2. P u U SubnrdLnn
PROJECT
performance-oriented eumponent of the data representation. It tends t^ improve response tu qneries and updates
and, at the same time, slow down responsB to insertions
and deletions. From an infunnatiunul standpoint, an index
syitem uat9 indices at all and if it is to perform well in an
environment with changing patterns of activity on the data
bank, an ability to create and destroy indices from time to
time will probably be necessary. The question then arises:
Oan applieatjon programs and terminal activities remain
invariant as indices rome and gu?
Present furmatted data systems take widely diiTerent
appreaches to indexing, TDMS |7| nnconditiunally pro\qdes indexing on all attributes. The presently released
versiun of IMS |5| provides the user vi-ith a choice fur each
filei a choice hetween no indenting at all (the hierarehic sequential urganiEatiun ] ur indexing on the primary key
only (the hierarchic indexed sequential urganization). In
rieitlier case is the user's application Iugic dependent on the
existenee uf the uneonditionallv provided indices, IDS
[S|, however, permits the file designers to select attributes
to be indexed and to incorporate indices into the file structure hy means of additional chains. Applicatiun programs
taking advantage of the performance benefit of these indexiriR chains mustrefer to those chains by uume. Such programs do not operate correctly if these chains are later
1.3.3. .4cM3 Folk Dependence. Many uf the existing
formatted data systems provide useis with tree-structured
files or slightly more general network models of the data.
Application programs developed to work with these systems tend to be logieally impaired if the trees or networks
are changed in structure, A simple example foUowa.
Suppose the data bank contains information ahout parts
and projects. For each part, the part number, part name,
part description, qnantity-on-hand, and quantity-on-order
are recorded. For each project, the prejeet numher, project
name, project desciiptiun are recorded. Whenever a project
makes uae of a certain part, the quantity of that part committed to the given proiect ia aUo recorded. Suppose that
the systeni requires the user or file designer to declare or
define the data in terms of tree structures. Then, any one
of the hiemichical structures may be adopted fur the informatiun mentiuned above (see Structnred 1-5),
projon
part nune
Since, in general, it is not practical to develop application programs which test for all tree structurings permitted
hy the system, these programs fail when a chaiiEe in
structure becomes necessary'.
Systems wluch provide users witli a iietnurk mrxlcl of
the data rxin into similar difficulties. In both the tree and
network eases, the user (or his pmgraiTi) ia required to
exploit a collection of user access paths to the data. It does
not matter whether these patlis arein close currespoiidence
with pointer-delined paths in the stored representation-in
IDS the eorrrapondence is extremely simple, in TDMS it is
jnst the opposite-The con.4equenee, regardl^s of the stored
repre^Dtation, is that terminal activities and programs become dependent un the continued existence u( the user
acce^ paths.
3, PirU and
ubordinate Lu Proi
PART
part f
pirl deflrriptiori
PROJECT
prajtct I
PART
project dearrJptiDD
part f
pin i
PROJECT
quaiilily-on-order
proiMl i
PROJECT
project f
o unDtEtV'O n -huid
projtal f
project name
inly, R :
65
(5) The significance of each column is partially conveyed by labeling it with the name of the corresponding domain.
The eiampIc in Figure 1 iUustnktes u relation of dagree
4, called supplfip which reflects the shipmcnts-iil-progresi
of parts frem specified supplieis to specified projects in
specified quantities,
supply
It1ipplir
Torl
projl^l ir'anl.lg]
VIKW u r U.KVh
The term relalum is used here in its accepted mathematical sense. Given sets S,,S,.
, S. (not necessarily
distinct), /f is a relation on these rt sets if it is a set of ntuples each of which has its fint clement from S,, ita
second element from S,, and su on-' We shall refer to S/ as
the jth domain of R. As delined above, R is said to have
degree n. Relations uf degree I are often called unary, degree 2 binary, degree 3 ttmary, and degree n n-ary,
PROJECT
-II HtLAIlOWAl,
F[U ?,
l ier
1.5.
salnry)
Normalization proceeds as follows. Starting with the relation at the top of the tree, take its primary key and expand each of the immediately subordinate relations by
inserting this primary key domain or domain combination,
Tbe primary key uf eaeb expanded relation consists of the
primary key before expansion augmented by the primary
key copied down frem the parent relatiun, Nuw, strike out
from the parent relation all nunstmpledumains,removethe
top node of the tree, and repeat the same sequenee of
operations on each remaining subtree.
The result of normalizing the ooUection of relations in
Figure3(a)is the collection in Figure 3 (b). The primary
key of each relation is italicized to show huw ouch keys
are expanded by the normal izAtion,
66
nd Con
2. Rtdunda
Since relations are sets, all of the usual set Ujiemtions are
applicable to them. Nevertheless, the result may not be a
relatiun; for example, the union of a binary relation and a
ternary relatiun is not a relatiun.
The opexatiuns discussed below are specifically fur relations. These operations are introduced because of their key
rule in denvmg relations from other relations. Their
principal application is in noninferential information systems^systems which do not previde logical inference
servicesalthough their applicability is not iieccssarity
destroyed ivhen such services are added.
Most usera would not be directly concerned with these
operatinn.4. Information systems designers and people conri,(R.S) - R
cemed with daU bank control should, however, be thoroughly familiar with them.
2.1.1, PmmiliUiim. A binary relation has an array
representation with two eolumns. Inlerehanging thete colNote that the join shown in Figure 6 ia the tiatur^ join
umns yields tbe converse relation. More generally, if a
f ;; "ith .'^ frum Figure 5, Another join is shown in Figure
permutation is applied to the columns of an i-arv relatiun,
the resulting relation is said to be a permMltdion of the
given relation. There are, for example, 4! = 24 permutations of the relation mpply m Figure J, if we inobde the
identity permutation which leaves the ordenng of columns
unchanged.
Since the user^s relaliunal model consists of a collection
of relationships (domain-unordered relations), permutation IS not relevant to such a model considered in isolation.
Via. 4, A parmurAd
on of [he relaliou in Figun L
It is, however, relevant to the conaderation of stored
representations uf the mudel. In a systeni which pmvides
symmetric exploitation of relations, the set uf queries
ana^verable hy a stored relation is identical to the set
answerable by any permiitatinu of that relation. Although
it i logically unnecessary Ui store both arelationand some
permutation of it, performance considBration,^ euuld make
it advisable,
2.1.2, Projection. SupiKiw nuw we select eertain columns of a relation (striking out the utiicrs) and then remove from the resulting array any duphcalion in the rows.
The final arruy represents a relation whieh is sud to be a
projtction uf the given relation,
A selection operator T in used Co obtain any desired
permutation, projection, or combination of the two operations. Thus, if L is a list of Jt indicw' L ^ i|. t,, ,i,
and R isan n-ary relation (n > t), then ii(fi) is thct-ary
relation whuse Jth columnLicolumni,ufR(j- 1,2, ,, ,k)
except that duplication in resulting ruwE isremoved-Cunaider tlie relation sapply of Figure 1 A permuted projection
of this relatiun is exhibited in Figure 4, Note that, in this
particular caae, the projection has fewer n-tuplcs than the
relation from which it is derived,
Flu. 1, Aniithi-i join ai It wilh S [frnin Figun i)
2-lJJ, Join. Suppose we are given twu binary relations, which have aome domain in common, Undec what
Inspection of tlitse relations reveals an element (elecireumstanctB can we combine these relations to form a
ment 1) of the domain part (the domain on which the join
is to be made) with the preperty that it poft^eaee mure
'Wben dealini intb
dum
(lolithaTi one relative under R and also under S. It LI this cleaf dora
1.6,
2,1,
Assouated with a data bank are two collections of relations: the named set and the expressible >fl- The named set
is the collection of all those rdatiuns that the eommunity uf
users can identify by means of a simple name (or identifier),
ArelatiunA aequires membership in the named set when a
.\ritlimetjc functioii-H may bo needed in the quahfication
suitably authoriiod user declares R; it loses membership
or otiier parts of retrieval statements. Suth funetions can
when a suitably authoriied user cancels the declaration of
be defined in H and invoked in R.
R
A si-t soflpcsifiedmay be fetched for query purposes
The expressible set is the total collection of relatiuns that
unly, or it may be held for pus.sib!e clianges- Insertions take
can be designated by expressions in the duta language. Such
the form of adding new elements to declared relatiuns withexpressions are constructed from simple nam^ of relatiuns
out regard to any ordering that may be present in their
m the named set; names uf generatiuns, rules anil domains;
machine representatiuii, I!)eletions u'hich are effective for
logical connectives; the quantifiera uf the predicate calcutbe community (as opposed to the individual user or sublus ;' and eertain constant relatiun symbols such as =, >,
communiti^) take the form ofremovingelements from deThe named set Ls a subset of the expressible setusually a
clared relations. Some deletions and updates may he trigvery small subset.
gered by otiiers, i/ deletion and update dependencies beSince sume relations in the named set may be time-indetween ipecified relations are declared in R.
pendent combinations of others in that Bet, it is a^ful to
One important effect that tbe view adopteil toward data
consider aairaciatini; with the named set a collection of
has un the language used to retrieve it is in the naming uf
statements that define these time-independent constraints.
data elements and sets. Some aspects of this have heen disWe <hall postpone further discussion of this until we have
cussed in the previous section. With the nsual network
introduced several operations on rektiona (see Section 2).
\iew, u^rs will often be burdened mth coining and using
One of the major prohlems confronting the designer of a
more relation names than arc absolutely jieceiiar>, since
data sj-stem whieii ia to support a relational model for ita
narnCH are a.'^ueiatfd ^'itJj paths (ur path t}'pes] rather
users i^ that uf determining the class uf stored representa,
Ihan with relations,
tions to be ^lupported- Ideally, the variety of permitted
Onee a ueer ia aware that a certain relatiun is stored, he data representations should he just adequate to cover the
" ill expect to he able to exploit' it using any cumbination
speetmm of performanci requirements of the total eoluf its arguments; aa "kuowiu" and the remaining argulectiun uf installations. Too great a variety Ifads to unment? a,^ "unknon'ns," becaiLse the informatiun (like
necessary overhead in storage and continual reinterpretaEverest) is there. This is a system feature (mining frem
tion of descriptions for the BtructuiTS currently in effect.
mnny current infonnation syatems) wbich we shall call
(lugically) 8j/T7initric eiploiUiLion of reUtions, Xnturally, For any selected elass of stored representations the data
sv-stem must provide a means of translating user reouesbi
e^'mmetry in perfurniance is not to be es^iected,
expre.^sed in tbe data language of the relational model into
Tu support symmetric exploitation of a single binary recorresponding and efficient-actiuns on the current
lation, two directed paths are needed- For a relatiun uf destored representation. Fur a high level data language this
gree n, the number uf paths to be named and controlled is
presents a challenging design problem- Nevertheless, it is a
n factorial.
problem which must be solvedas mure UAST^ ubtain conAgain, if a relational view is adopted in which every ncurrent acrsa to a large data bank, responsibility for proary relation (n > 2J has to be expressed by the user as a
viding efficient response and throughput shifts frum the
nested expression involving only binary relations (see
individual user to the data system.
Fcldman'B LEAP System 110], for esample) then 2n - 1
names have to be coined in-^tead of only n + 1 with direct BeriuH nuh celalioa in a pruticij il~la hitik la i flnitc ael al
n-:kry nutation as described in Section \.'l. For example, the
67
ney
OPEFUTIONS ON RELATIONS
ment which gives rise to the plurality of joins, Sueh an element in the joining domain is called a pnint of ambiguity
with respect to the joining uf R with S.
If either IU (Rl ur S i.s a function,' no point of ambiguity
can occur in juining R with S. In sueh a case, the natural
join of R itli S is the only juin of R with S, Note that the
reiterated qualification "of R with S" is necessary, because
S might be joinuble with fi (as well aa R v,ith S). and this
join would he an entirely separate consideration. In Figure
S, none ofthe relations R, iii(fi), ,5, jra(S) is a function.
Ambiguity in the joining of R with S can sometimes be
re?ulved by means of other rel.itions- Suppose we are given,
or eau derive from source independent of R and 5, a relation 2' on tbe domains project and supplier with the following properties:
(1) liT)
= \{ii,h,c,d):R{a,b)
RS
Taking the relations R, S from Figure 5, their natural composition is exhibited in Figure 10 and another eompusitiun
is exhibited in Figure 11 (derived from the join exhibited
in Figure 7),
A S(&, c) A Tl_c,(l)]
n-here parentheses are not needed on the left-hand side because the natural 2-join (] is associative. To obtain the
cyclic counterpart, we introduce the operator y which produces a relation uf degree n 1 from a rolatiun of degree n
by tying its ends together. Thus, if K is an n-ary relation
(" > 2), tbe lie offi is defined by the equation
= MS),
i ot R wiLb S (from
FiQ l l , Anolh
AStpJ)).
T(R.S.Tj.
r^(U) = S,
((/l = T.
1. p ()
1 a d
2 a d
than those for a plurality of 2-juiiis, To bo specific, the relations R, S, T must posMas puinta of ambiguity with
respect lo joining fi n-ith S (say point i), S with T (say
When two or
joins exist, the number of distinct
poaitiun ay he as few as one or as many as the number uf distinet joins. Figure f2 shuws an esample of two
relatiuns which have several jums hut only one composition,
^ute that the ambiguity of point c is lost in composing R
with S, becaus of
biguous associations made '.'ia the
points a, b, d,
(,i) S ( p , j l - 3 s ( r ( j , 3 ) A R(s,p)).
ir,,(U) - R,
= Ti,(fi.S),
68
j)
j)
! p J)
ili(nMiiio5eniant<) C ^.(minie)
for all time t. In tliis case the redundancy is obvious: the
domain Tnanagenarne is unncceasarj'. To see that it is a
strong redundancy as defined above, we uhserve that
ruiemployee) = vi!tf>nployee)i\iriiemployee).
In the second esample tlie collection of relations includes a
relation S describing suppliers with primary key sff, a relatiun D deseribing departments with primary key df, a
relation/ describing projects with primary key j j , and the
following relations:
- ) ,
example Bill show the reasonableness of this (possibly unconventional 1 approach to consiatenc)'.
Suppose the named set C includes the relations S, J, D,
P. Q, R of the example in Section 2,2 and that P, Q, R
pos.'^ess either the strong or weak redundancies deecrihed
therein (in the partieular case now under consideration, it
docs not matter which kind of redundancy occurs)- Further,
suppose tbat at some time I the data bank state is consistent
and contains no project j such tbat supplier 1 supplies
ftroject j and j is assigned to department ?t. Accordingly,
there is no element (2,51 i n i i i [ P ) , Now, a user introduces
the element (2,5) intoir,,(P) by inserting some appropriate element into F. The data bank state is now inconsistent.
The incunsisteney could have arisen frum an aat uf omissiun, if the input t'l. 5) is correct, and there does exist a
project j'sucli that supplier 2 suppliffi ; and J is assigned to
department 5, In tbis case, it is very likely that the user
intends m the near future to insert elements into Q and H
hich will have the effect of introducing (2, j ) into .i,(Q)
and ( 5 , j ) i n T , , ( R ) , On the other hand, the input (2, 5)
might have been faulty. It eould be the case that the user
intended to insert some other element into Pan element
whose insertion would transform a iMinsistent state into
a consistent state. The point is that the system will
normally have no way of resolving this question without
intermgating ita environment (perhaps the user whu created the inconsistency).
Given a collection C of time-varying relations, an associated set 2 of constraint statcmen ts and an instantaneons
value V fur C. we shall call the state (C, Z, V) aauislent
or ifivtnsistfnt according as V dues or does not satisfy Z.
For example, given storod relations R, S, T together with
the constraint statement "ii(T'| is a composition uf
iri,(Rl with r^,(S)", we may cheekfrom time to time that
the values stored fur R, S, T satisfy this constraint. An algorithm for making tlii3 check would examine the first two
columns of each of R, S, T (in what*\-er way thoy are represented in the system) and determine whether
2,4.
SHMMAHV
npit il noilbo
69