Anda di halaman 1dari 161

Articles in the Time and Time Again Series.

Time and Time Again: Managing Time in Relational Databases, Part 1


Tom Johnston, Randall Weis
Article published in May 2007, in DM Review magazine

Time and Time Again: Managing Time in Relational Databases, Part 2 -


Preliminaries
Tom Johnston, Randall Weis
Article published in May 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 3 -


Version Patterns 1 and 2
Tom Johnston, Randall Weis
Article published in May 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 4 -


Preliminaries to Version Pattern 3
Tom Johnston, Randall Weis
Article published in June 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 5:


Version Pattern 3
Tom Johnston, Randall Weis
Article published in June 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 6:


Version Pattern 4
Tom Johnston, Randall Weis
Article published in June 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 7:


Completing Version Pattern 4
Tom Johnston, Randall Weis
Article published in July 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 8:


Version Pattern 5
Tom Johnston, Randall Weis
Article published in August 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 9:


Completing Version Pattern 5
Tom Johnston, Randall Weis
Article published in August 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 10 -


Foreign Keys in Version Tables
Tom Johnston, Randall Weis
Article published in September 2007, in DM Direct

Time and Time again: Managing Time in Relational Databases, Part 11 -


Foreign Keys in Versioned tables Continued
Tom Johnston, Randall Weis
Article published in October 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 12:
Foreign Keys in Version Tables
Tom Johnston, Randall Weis
Article published in October 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 13 - A


Roadmap of the Rest of This Series
Tom Johnston, Randall Weis
Article published in November 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 14 -


Versioning with Surrogate Keys
Randall Weis, Tom Johnston
Article published in November 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 15 - For
and Against Surrogate Keys
Tom Johnston, Randall Weis
Article published in December 2007, in DM Direct

Time and Time Again: Managing Time in Relational Databases Part 16 -


Physical Implementation of Version Pattern 1
Randall Weis, Tom Johnston
Article published in the December 21 2007 issue of DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 17 -


Physical Implementation of Version Pattern 2
Randall Weis, Tom Johnston
Article published in the January 4, 2008 issue of DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 18 -


Physical Implementation of Version Pattern 3
Randall Weis, Tom Johnston
Article published in the January 18, 2008 issue of DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 19 - RI


and Temporal RI
Tom Johnston, Randall Weis
Article published in the February 1, 2008 issue of DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 20 - RI


and Temporal RI (Continued)
Tom Johnston, Randall Weis
Article published in the February 15, 2008 issue of DM Direct

Time and Time Again: Managing Time in Relational Databases, Part 21 -


Enforcement Time Options for Temporal RI
Tom Johnston, Randall Weis
Article published in the March 7, 2008 issue of DM Direct.

Time and Time Again: Managing Time in Relational Databases - Part 22:
Original and Temporal Deletes
Tom Johnston, Randall Weis
Article published in the March 21, 2008 issue of DM Direct.

Time and Time Again: Managing Time in Relational Databases, Part 23 -


Original and Temporal Updates
Tom Johnston, Randall Weis
Article published in the April 4, 2008 issue of DM Direct.

Time and Time Again - Managing Time in Relational Databases, Part 24:
Original and Temporal Updates (Concluded)
Tom Johnston, Randall Weis
Article published in the April 18, 2008 issue of DM Direct.

Time and Time Again: Managing Time in Relational Databases, Part 25 -


Original and Temporal Inserts
Tom Johnston, Randall Weis
Article published in the May 2, 2008 issue of DM Direct.

Time and Time Again - Managing Time in Relational Databases, Part 26:
Points in Time and Periods of Time
Tom Johnston, Randall Weis
Article published in the May 16, 2008 issue of DM Direct.
Time and Time Again: Managing Time in Relational Databases, Part 1 http://www.dmreview.com/issues/20070501/1082475-1.html?type=printe...

Time and Time Again: Managing Time in


Relational Databases, Part 1
Tom Johnston, Randall Weis
DM Review Magazine, May 2007

Relational databases usually keep data about the current state of things - about how things are
now. Updates do occur, of course; but when they do, they overwrite the data already there. In
doing so, history is lost. We can't tell how things used to be. To tell how things used to be, as well
as how they are now requires that we introduce time into our databases.

This series of articles on a very technical topic is motivated by the fact that increasingly,
businesses need real-time access to historical data. No longer can all historical data be pushed off
onto archive files or onto less immediately accessible historical databases. The objective of this
series of articles is to show how to provide real-time access to history in particular, and to any kind
of noncurrent data in general, and to show how to do so using the relational database
management systems (DBMSs) and SQL that are available today.

First, a note on terminology. It will often be convenient to talk about history, because that is the
kind of noncurrent data that businesses are primarily concerned with. But the real topic is the
representation of time in relational databases, whether past, present or future time, whether
real-world or database time, whether actual or hypothetical time, whether linear or branching time.

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

Managing time in today's relational databases is difficult. With more than 50 years of combined
experience in data and database management, we have encountered no database model more
complex for its size than the model we will propose, no queries fraught with more possibilities for
misstatement and misinterpretation than queries against databases that attempt to manage time.
But we know of no simpler way to meet business requirements for real-time access to what we will
call "versioned history."

However, very little support for time management has yet been provided by DBMS vendors or is
specified in the current SQL standards. The complexity of time management cannot be hidden, as
it ultimately must be, inside the DBMS itself, with declarative interfaces provided in extensions to
the SQL DDL and DML languages.

But businesses need robust and reliable time management in their databases now, not eventually.
So we take a do-it-yourself approach. And the first thing we must do ourselves is find a way to
encapsulate both time management data structures and the additional complexities in SQL
statements that are required for time management data access.

As for updating temporal data structures, all updates should be mediated by application program

1 of 4 10/7/2008 9:23 AM
Time and Time Again: Managing Time in Relational Databases, Part 1 http://www.dmreview.com/issues/20070501/1082475-1.html?type=printe...

interfaces (APIs). If an operational data store (ODS) or data warehouse is the database being
updated, the API function will be provided by the extract, transform and load (ETL) process,
whether hand coded or implemented with a commercial ETL tool.

For retrievals against temporal data structures, all queries should access a set of view tables from
which as much of the technical detail of managing time will have been removed as possible.
These tables will satisfy the following two requirements:

When point-in-time data is wanted, these view tables will look, column for column, like tables
that are not temporally managed, i.e., like "typical" or "normal" tables in a database.
When that desired point in time is now, these view tables will guarantee that existing
retrievals will remain completely unaffected as time management is added to the tables they
access.

We should think of these temporal tables as no more accessible to the end user, DBA or
developer than are the dynamic balanced-tree access method structures that provide the file
storage subsystem for many commercial DBMSs. The only database users able to directly
manipulate these temporal tables should be:

The developers who translate source system or end-user-supplied update transactions into
updates to these temporal tables, and
The developers who write the CREATE VIEW statements that encapsulate these tables for
retrieval transactions.

Figure 1 illustrates the encapsulation necessary to hide the complexities of temporal database
structures from both those who update those structures and those who retrieve data from them.

Figure 1: Encapsulating Temporal Tables

When we IT professionals talk about history in databases, we may have several different things in
mind. A taxonomy of the types of history that IT data management professionals talk about is
presented in Figure 2.

2 of 4 10/7/2008 9:23 AM
Time and Time Again: Managing Time in Relational Databases, Part 1 http://www.dmreview.com/issues/20070501/1082475-1.html?type=printe...

Figure 2: A Taxonomy of History as Managed in RDBMSs

Database history is historical data kept in relational database tables. The two major categories of
historical data are reconstructable history and queryable history.

Reconstructable history is data about the past state of something, obtained by restoring a
backup file and then applying update transactions captured in a DBMS logfile, from that point in
time forward to the point in time of interest. This requires the intervention of IT personnel and is
therefore not real time.

Queryable history is data about the past state of something, obtained with a SQL query, without
the need for restoring backups and reapplying logged updates. Because the query may be
directly executed, this type of history is possibly real time. It is actually real time if its performance
meets the parameters which define real time for the class of query it belongs to. Queryable history
may be kept as either a record of events or as a record of states.

Event history is data about events that have altered the states of things. This data is captured as
transactions in tables that are best queried with links to the leaf-level entries of a set of
dimensional hierarchies. Thus, these tables are often queried as the fact tables in star schema
data marts, usually by means of OLAP reporting tools.

State history is historical data, captured as copies of the current state of an object, either on a
periodic basis or in response to a specific update event. State history may be kept as either
snapshots or versions.

Snapshot history is a coordinated set of copies in a relational database. Snapshots are often
taken at the entire database level, but sometimes at the level of a semantically related subset of
tables. Snapshots occur at regularly scheduled intervals. Consequently, snapshots will miss any
updates that are overlaid by later updates to the same data that happen before the next snapshot
is taken. Another drawback to snapshots is that they are an inefficient way to record history,
because they create copies of rows whether or not those rows have changed.

Version history is individual row-level logical updates, implemented without overlaying data, by
"retiring" the current version of the object and replacing it with a new version that contains the
updated data. Versions are created on an as-needed basis, i.e., every time an update important
enough to be versioned occurs. Consequently, versions will not miss any updates. They are also
an efficient way to record history because they create new rows only when a versionable change
has occurred.

Thus, for example, suppose that the current version of customer 123's name is Mary Smith. When

3 of 4 10/7/2008 9:23 AM
Time and Time Again: Managing Time in Relational Databases, Part 1 http://www.dmreview.com/issues/20070501/1082475-1.html?type=printe...

she changes her name to Mary Jones, the Mary Smith version of that customer loses its status as
the current version. The Mary Jones version is inserted and marked as the new current version.
Effectivity dates indicate when each version was the truth about Mary, as the business knew it.

Over the past two decades, many suggestions have been made as to how vendors should
implement time management in relational databases, and how SQL DML and DDL should be
modified to provide declarative access to those new capabilities. The best access to pre-2000
computer science research is written by the principal computer scientist working in the field of
temporal databases.1 A well-known commentator on relational databases and a member of the
SQL standards committee have also written on this topic.2

But our focus is not on recommendations to vendors or to the SQL standards committees. Our
focus is on the here and now. Given today's relational DBMSs, and the SQL common to them,
how can we provide the temporal support that businesses increasingly need, and that they need
now?

This series of articles will continue in the DM Direct newsletter and will be published every other
issue beginning May 4. The discussion will be at the implementation level of detail, although to
emphasize its relevance, we will present this very technical material in terms of a real-world
business application. In these articles, academic research will be brought to bear on the problem
of providing temporal data management in today's relational databases. However, our emphasis is
more on real-world implementation than on theory. These articles will address the problems and
issues faced by business IT data architects, data modelers, DBAs, those developers who write the
ETL or other interface logic to update temporal tables and those developers who write the views
which insulate the rest of the IT and business community from the complexities of temporal data.

It's about time.

References:

1. R. T. Snodgrass. Developing Time-Oriented Database Applications in SQL. San Francisco:


Morgan-Kaufmann, 2000.
2. C. J. Date, Hugh Darwen and Nikos Lorentzos. Temporal Data and the Relational Model. San
Francisco: Morgan-Kaufmann, 2002.

For more information on related topics, visit the following channels:

Databases
Real-Time Enterprise

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

4 of 4 10/7/2008 9:23 AM
Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 2 - Preliminaries
Tom Johnston, Randall Weis
DM Direct, May 2007

The needs which businesses have for versioned history are many and varied. However, those
needs and the data patterns which satisfy them seem to fall quite nicely along a linear progression
from simple to complex. We will call the patterns that satisfy each set of business needs "version
patterns" because, in the terms organized by the taxonomy in Part 1 of this series, we are
concerned with queryable history of the states of persistent objects, kept as versions of those
objects.

However, our concern will not be exclusively with past and present versions. The more general
topic is maintaining and querying tables which contain noncurrent as well as current versions of
their persistent objects. And there are other kinds of noncurrent versions than historical ones. We
will also consider versions for error correction, versions for future states, versions which overlap in
time, coordinated (pending/sandbox) updates and even versions for hypothetical states.

Version Patterns: Manner of Presentation

In this series of articles, we will discuss a dozen version patterns. Each pattern will be introduced
by describing the business requirements it supports.

Intuitive Data Reporting & Dashboards


Get instant visibility into your business with clickable access to your
data no matter where it resides. Customize reports & dashboards on
the fly.Try PivotLink now.

Next, we will show how these version patterns work by presenting several scenarios and timelines
for each pattern. Each scenario will show the results of doing an insert, update or delete against a
table constructed according to the pattern. Each timeline will show what would be returned by a
query against the past, present or future state of the object whose versions are being used as the
example.

Initially, these transactions will be simple ones that physically affect just one row in one table. But
gradually, more semantically expressive scenarios will be considered, ones in which what is seen
as one transaction to the issuing code or person will result in several rows being affected in
possibly several tables.

Following the scenarios for each history pattern, we will present and comment on a list of semantic
constraints that must be followed for the pattern to work. Most of these constraints can be
enforced only by triggers, stored procedures or application code. It is an indication of how
"special" and complex temporal state management is that the normal DBMS entity and referential
integrity mechanisms can do very little to enforce the semantics of these patterns.

1 of 6 10/7/2008 9:21 AM
Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

Finally, each pattern will conclude with a list of objections to it that we have encountered. Some of
these objections are valid in the sense that if the different business requirements for the pattern
were weighted a little differently, a different pattern or at least a different physical implementation
than that suggested by the pattern would be appropriate. Other objections are not valid, either
because the pattern was not fully understood by those criticizing it, or because the flaws in the
alternative pattern that the critics preferred were not fully understood.

In this series of articles, it will not be possible to discuss all the objections to these patterns that
we are aware of. But we can, and will, at least note the existence of the objections we are aware
of.

Following is an annotated list of the version patterns which we will discuss in these articles.

Figure 1: Annotated List of Version Patterns

Scenarios

2 of 6 10/7/2008 9:21 AM
Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

Figure 2 shows the template we will use to present scenarios.

Figure 2: The Scenario Template

The numbered boxes in the template will contain the following information:

1. Name of the scenario.


2. Table used for this scenario.
3. Definition of the table.
4. Description of the scenario.
5. Now (whatever date that may be).
6. Column headings for the table.
7. One or more rows of the table.

Basically, boxes 1 through 5 contain metadata that is necessary to understand the scenario.
Boxes 7 are the column headings for the table under discussion. There are one or more rows
indicated as boxes 8, and each one represents a row in a relational table.

Timelines

Except for the first two patterns, we will also use a timeline diagram to explain what is going on.
Figure 3 illustrates this diagram. It will be used beginning with Version Pattern 3, which is the first
pattern that goes beyond a simple update in place approach to history.

Figure 3: The Timeline Diagram

Health Care Insurance: A Business Example for Version Patterns

Health care insurance is not the only industry which has a need for real-time history. But the
insurance industry in general is certainly one in which real-time history is very important. Two
paradigmatic insurance situations make the point: processing a claim and answering a customer
telephone inquiry.

As for the first paradigmatic situation, a claim is filed for an event which happened at some point in
the past. The claimant has a policy with the insurance company, to which she sends the claim.

3 of 6 10/7/2008 9:21 AM
Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

But the policy which is relevant to the claim is not her policy as it is when the claim is received.
Rather, it is her policy as it was when the claimable event occurred.

In property insurance, the claimable event is damage to property. In health care insurance, the
claimable event is a service event, an event in which a doctor, hospital or other health care
professional or organization provided medical services to the claimant.

Processing a claim involves comparing the claim to a past state of the policy against which the
claim is being made. That past state may happen to be identical to the current state of the policy.
But the point is that it equally well may not. And, in point of fact, it often is different. So processing
a claim requires real-time access to historical data.

There is a customer-facing aspect to real-time claims data, also. And because it is customer-
facing, it represents not just a need which a processing system has, but a product differentiator for
the insurance company. This is our second paradigmatic situation.

These events occur when a customer calls the insurance company and asks for an explanation of
benefits. Usually, what they are concerned with are benefits not provided, i.e., claims not paid or
claims that were reimbursed for less than the policyholder thought they should be.

The customer service representative (CSR) needs online, real-time history to answer the
customer's questions during that phone call, there and then. He needs complete information on
the caller's policy, as well as the date the service was provided for which the claim the caller is
concerned with was made. The need for real-time access to policy history exists because it is
important to answer the caller's questions while the caller is on the phone. Companies which
cannot do that are at a serious competitive disadvantage.

Other industries will have their own paradigmatic situations illustrating their own needs for
real-time access to history. But claims processing is the operational heart of insurance, and
customer satisfaction is the operational soul of any business.

So: historical data is already being managed in insurance company databases. Our intent, in this
series of articles, is to present what we believe is a better way of satisfying this need than the
various ad hoc solutions we have seen over the past few decades. And better has the bottom-line
meaning it always has had: better in terms of revenue enhancement, cost containment and
customer service.

Two additional points will serve to complete our introduction to this topic. The first point illustrates
the increasing importance of versioned history by considering the breakdown of a traditional
alignment of current versus historical with operational versus analytic. The second point is that
dates and timestamps (of varying granularity) are just different ways of representing the tick of a
logical clock that provides reference times for all versioned history.

Operational Data and Analytical Data

Another way of realizing the significance of this topic is to understand it in terms of the breakdown
of a traditional alignment. Traditionally, the distinction between operational data and analytical
data has lined up with the distinction between current data and historical data. In talking about the
increasing need for real-time access to historical data, we are emphasizing that this alignment is
breaking down.

On the one hand, data marts have traditionally been considered part of the analytical environment,

4 of 6 10/7/2008 9:21 AM
Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

not part of the operational environment. Data marts are and always have been a means of
discovering trending over time patterns. But near real-time data is increasingly being required in
data marts. The reason is that businesses are finding that they can make better analytical
decisions if the relevant data reaches right up to the present moment.

This is a particularly acute need for large-scale inventory management systems of consumer
goods, especially during the holiday season. The analytics needed from such systems do not
directly concern on-hand, on-order, sales and returns volumes. Rather, the single critical concern
is with velocities. The critical question isn't, "How much did we sell last week?" or even "How much
do we have on-hand and on-order right now?" The critical question is, "Is our on-hand and our
pipeline adequate to keep product in stock, given the rate at which it is being sold?" Since sales
volumes for much consumer merchandise peaks during the holiday season, the velocity that
decision-makers are concerned with, at that time of year, isn't one based on rates of change
reaching back six months. It's one based on rates of change reaching back six days! So this is a
situation in which current data is needed for analytical decision-making.

If you are thinking, at this point, that we have just described a need for operational data, not for
analytical data, then you are illustrating the confusion caused when an alignment (operational =
current, analytical = historical) breaks down. What we have just described is a need for near
real-time data. If we put the currency of data aside, however, what distinguishes operational from
analytical data?

One answer might be that analytical data is used by upper-level management, while operational
data is used by lower-level management. While this happens to be true, it isn't very informative.
What we need to know is what upper-level management is doing with its data that is different from
what lower-level management is doing with its data.

The answer is that analytical data attempts to predict the future, while operational data attempts to
optimize the present. Analytical data uses the past to establish curves, one of whose axes is time.
The more accurate those curves are, the better an indicator they will be of future trends. For the
best basis on which to guess what the future will be like is a sufficiently long and accurate timeline
of what has happened in the past. Just project the curve into the future.

On the one hand, as our inventory management example illustrated, real-time data is needed for
analytical decision-making. Conversely, operational data, whether in OLTP systems or in
operational data stores (ODSs), increasingly includes relevant historical data, and operational
decision-making increasingly requires access to that data that is just as real-time as is access to
current data.

Again, in the insurance industry, the past state of insurance policies is the relevant example. The
claim is current data. The relevant state of the policy is historical data. So this is a situation in
which historical data is needed for operational decision-making. The old "operational = current,
analytical = historical" alignment is breaking down. Historical data is no longer "second-class"
data.

Every Tick of the Clock

Throughout these articles, we will be talking about dates, not timestamps. This is for the sake of
simplicity; nothing in these patterns would be disturbed if we were to talk about timestamps
instead. However, this simplification means that the granularity at which time will appear in these
examples is the granularity of a single day. For example, we will talk about a version ending one

5 of 6 10/7/2008 9:21 AM
Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

day prior to the start of the next version.

More generally, a version ends one tick of the clock before the start of the next version (assuming,
as we do for now, that there can be no gaps between versions of the same thing being versioned).
That is one tick of whatever "clock" is being used. With dates, every clock tick is one day.
However, our discussions apply equally to situations in which timestamps are used instead of
dates. With timestamps, the clock ticks once for every change in timestamp value, at the
granularity at which the timestamps are recorded. Thus, for example, timestamps can record a
clock that ticks in hours, minutes, seconds, milliseconds, etc.

Using a clock for which each tick is one day isn't at all unrealistic. It supports a business
requirement, for example, to load a data warehouse or ODS once a day, with whatever the source
data is like at that point in time. For many warehousing situations, a clock that ticks once a day is
fine-grained enough to meet all business requirements for the warehouse.

Note: We will not include century in the dates used in these examples. Instead, we will specify
them as "mm/dd/yy." This is to keep the columns as narrow as possible, to fit entire rows across a
page.

However, there is one "special date" value that will require a century. It's the value "12/31/9999."
Obviously, "12/31/99" cannot be reserved for a special meaning, since it is a date that is likely to
appear as real business data. But "12/31/9999" is available to carry a special meaning, on the
assumption that no business data will occur that needs to specify that exact day nearly eight-
thousand years from now. That special meaning has some of the semantics of "end date
unknown," and also some of the semantics of "end of time." But we think that the most accurately
descriptive name for this value is "until next changed or deleted." These semantics will be
examined more thoroughly in a later installment in this series. The differences are subtle; and
sometimes, in the real world of business IT, they are even relevant.

With this special meaning, "12/31/9999" is not a date; that's why we refer to it as a date value, i.e.,
a value that appears in a date data type but which does not carry the semantics of a date. (If this
appears awkward to you, it is. We need the SQL standards committees, and the vendors, to
provide something better for us. But until they do, this is what we have to work with.)

Part 3 begins the discussion of specific version patterns.

For more information on related topics, visit the following channels:

Best Practices/Benchmarking
Business Intelligence (BI)
Customer Relationship Management (CRM)
Data Marts
Operational Data Store

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

6 of 6 10/7/2008 9:21 AM
Time and Time Again: Managing Time in Relational Databases, Part 3 - V... http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 3 - Version
Patterns 1 and 2
Tom Johnston, Randall Weis
DM Direct, May 2007

Version Pattern 1: Updating in Place

Updates in place destroy history by overwriting the previous version of the rows they update. But
in most cases, this is the way tables are in fact updated. In these cases, the business does not
require that the earlier states of the updated rows be available as queryable history. For this kind
of data and these business requirements, the need for history is so infrequent and so nonreal-time
critical, that reconstructable history is good enough.

However, even with tables that lack queryable history, it is quite common for developers to include
both a create date and a last update date; these two dates provide a minimal level of support for
history. Let's include them as business requirements for Version Pattern 1 and note them as
follows:

{BR 1}. Retain the date each row was created, and also the date of its last update.

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

The scenarios below illustrate Version Pattern 1. Our example will be an insurance policy table.

Scenario 1.1: an Insert to the Policy Table

On 1/12/04, a policy for client Mary Jaspers is inserted into the Policy table for the Acme
Healthcare Insurance Company (AHIC). For now, we will assume that this is also the action that
makes the policy effective. In other words, a policy with AHIC becomes effective on the date that a
row representing that policy is inserted into AHIC's Policy table.

Note: although 1/12/04 is in the past, as of the date of publication of this article, the paragraph
above said that on "1/12/04, a policy for client Mary Jaspers is inserted into the Policy table"
(italics added). It did not say ".... A policy ...... was inserted .......". This "is" should be consider a
tenseless "is," not an indication of the present tense. The reason to speak tenselessly whenever
possible is that otherwise, the tenses involved in an English language description of these
situations simply become too complex. With a tenseless natural language description, we will not
be distracted from the focus of these articles, which is the management of time in relational
databases, not the management of time and tenses in natural languages.

1 of 7 10/7/2008 9:24 AM
Time and Time Again: Managing Time in Relational Databases, Part 3 - V... http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

In later history patterns, we will drop this assumption, and distinguish system activity dates such
as dates rows were inserted, updated or deleted, from business effectivity dates such as begin
and end dates of business effectivity, or the effectivity dates for a correction to an earlier entry.
This distinction between system activity and business effectivity dates is referred to, in the
literature, as the "bi-temporal" distinction.

Note: those in the health care insurance industry will recognize that this example about policies is
not a typical situation. Typically, health care insurance policies are group policies, which are
contracts between a client company and the insurance company, not a contract between the
insured individuals and the insurance company.

In the property and casualty industry, on the other hand, policies which cover an individual or
family are usually contracts between the insurance company and that individual or family. But
health care insurance companies do issue individual policies, and we believe that this will provide
a simpler set of examples with which to illustrate our proposal for managing versioned history.

Scenario 1.1 shows the result of entering Mary's policy into the Policy table. (See the previous
article - Part 2 - for an explanation of the scenario template used here and throughout the
remaining articles in this series.)

Note: the conventions for distinguishing different types of columns in these illustrative tables are
as follows:

Primary key columns are listed left-most, and the column headings are bold-faced. They
contain the suffix (PK).
Foreign key columns are listed next, and the column headings are underlined. They contain
the suffix (FK).

Note: these articles do not contain examples of foreign keys which are also primary keys. One of
us (Johnston) has written several articles explaining why he thinks that foreign keys should never
occur in primary keys. See References for links to these articles.

Non-primary key and non-foreign key business data columns are listed next.
Metadata columns, such as crt-dt and last-updt-dt, are listed last.

The client-nbr foreign key can be ignored for now. It will be used in later scenarios. The same is
true for [now], whose template cell contains "n/a" (for "not applicable"). [Now] (the brackets
indicate that we are talking about a metadata item, and not merely referring to the present
moment) will not become relevant until we move on to more complex scenarios.

2 of 7 10/7/2008 9:24 AM
Time and Time Again: Managing Time in Relational Databases, Part 3 - V... http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

The policy type is PPO (preferred provider option). The first metadata column (crt-dt) tells us that
the row was inserted on 1/12/04. The {null} in the second metadata column (last-updt-dt) tells us
that, so far, this row has not been updated.

Presumably there are many other rows and columns in the Policy table. But we won't show any of
them until later on, when they are needed to illustrate discussions of more complex patterns.

Scenario 1.2: an Update to the Policy Table

On 3/14/05, Mary changes her policy type from PPO to HMO (Health Maintenance Organization).
After that change is recorded in the Policy table, Mary's Policy row looks like this:

The use of a create date and last update date is common in business IT. And with those two
dates, we are not entirely lacking in temporal metadata about this row. Crt-dt tells us that the row
was inserted on 1/12/04, last-updt-dt that it was last changed on 3/14/05.

But there is, nonetheless, a lot that we do not know. We do not know how many times this row
was changed prior to 3/14/05 (if any), nor what column or columns were changed by any earlier
updates. We don't even know what column or columns were changed on 3/14/05. And even if we
did know that it was the policy type column that was changed, we still wouldn't know what the
previous value was.

So, information has been lost. More precisely, using the taxonomy introduced in Part 1,
information in the form of queryable history has been lost. Reconstructable history is always
available, of course. In this case, to get Policy P138's reconstructable history, we would scan the
logfile of the initial inserts and all subsequent updates to the Policy table, starting on 1/12/04, and
arrange the logfile entries for P138 in chronological order.

Scenario 1.3: a Physical Delete to the Policy Table

On 6/1/06, Mary's policy is ended (by contract expiration, explicit termination, revocation, etc. - it
doesn't matter). This is recorded, using Version Pattern 1, by physically deleting her policy's row in
the Policy table. After that change is recorded in the Policy table, the table looks like this:

3 of 7 10/7/2008 9:24 AM
Time and Time Again: Managing Time in Relational Databases, Part 3 - V... http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

After the update shown in Scenario 1.2, as we said above, we have both the current state of
Mary's policy, and also a small amount of metadata. Now we have neither. We can't even tell that
Mary ever had a policy with AHIC.

Sometimes losing this much information to queryable access doesn't matter. Sometimes all we
need to know in real-time is what things are like right now. But frequently, businesses do require a
minimal amount of real-time available history, such as, in this case, information about policies no
longer in effect.

This requirement can be met by replacing a physical delete with a logical delete. This too is a
technique well-known to business IT professionals.

Eventually, logically deleted rows will be archived off the online table. This is usually done based
on their age. For example, the business rule might be "at each month-end, archive all logically
deleted rows that were deleted more than forty-nine months ago." In terms of the taxonomy
introduced in Part 1, this is a rule which states the conditions under which a row changes from
queryable to reconstructable.

But for now, we can ignore the archive process. What we want to do is see how a logical delete is
typically implemented. This is shown in Version Pattern 2.

Version Pattern 2: Updating in Place with Logical Deletion

Because physical deletes destroy even the little historical information that Version Pattern 1
provides, businesses frequently ask their IT departments for some way to preserve that minimal
level of historical information. Usually, the way this is done is to do a logical delete instead of a
physical delete.

{BR 2}. Support {BR 1}, and also retain deleted rows as queryable history.

There are variations on this theme, of course, as there are on almost any data pattern. We will
explain below why we chose the variation we did.

Scenario 2.1: A Logical Delete to the Policy Table

To support logical deletes, we need an additional column - a delete indicator. Often this is
implemented as a delete flag, set to "Y" if the row is logically deleted, and to "N" otherwise. The
date of the deletion is then put in last-updt-dt.

Another option is to have a separate delete date column. With this option, a delete flag isn't
needed, because logically deleted rows are all and only those rows whose delete dates are not
null.

We prefer this second option because it preserves rather than overwrites the metadata information
about the last update date (and also because we have a further use for this new column, which
we will discuss later on).

On 6/12/06, Mary Franklin's policy ends. Starting on that date, it is no longer in effect. This time,
we record that event by logically deleting her policy's row in the Policy table. After that change is
recorded, the table looks like this:

4 of 7 10/7/2008 9:24 AM
Time and Time Again: Managing Time in Relational Databases, Part 3 - V... http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

Note: to make it possible to show an entire row on one line, we will start doubling up. The two
metadata dates, crt-dt and last-updt-dt, will be stacked from this point forward. Later, other such
doubling-ups may be required. Keep in mind that in spite of the graphical layout, crt-dt and
last-updt-dt are still two distinct columns in this table.

The Semantics of Version Pattern 2

For Version Pattern 2 to satisfy its business requirements, the following constraints must be
enforced when updating the table:

{SC 2-1}. Always, del-dt must be null if the row is not logically deleted, and must otherwise
contain a valid date.

{SC 2-2}. When entered, del-dt must be the current date, or else a future date. It cannot be a
past date.

{SC 2-3}. When entered, foreign key semantics must be enforced.

{SC 2-4}. Once entered, del-dt cannot be changed. If del-dt is not null, the date in it must be
the date originally put in it.

{SC 2-2} is particularly interesting. What is the reasoning behind it? Why can't we use a past date
when logically deleting a row? After all, if we were supposed to delete a row on a particular date,
and simply forgot to do it, why can't we just apply the delete retroactively? Wouldn't that produce
the same result as if we had done the delete on its originally scheduled date?

The answer is "no." If a past date is used, we are changing the past, and doing so without leaving
any indication that we did. In other words, if a past date is used, we were either not telling the
truth before the logical delete, or we are not telling the truth after the logical delete. Here's why.

Imagine we were supposed to delete the row seven days ago and that we logically delete the row,
today, by setting its del-dt to the date seven days ago. Whether we did this seven days ago, or do
it today, the result will be the same. From today forward, the database will show a logical delete
on 6/12/06.

However, for the seven days prior to this action, the row was not deleted during those seven days,
and so during that period of time, our database says that the policy is active. But as soon as we
put a past date into del-dt, our database now says that the row was logically deleted during those
seven days, and therefore that the policy was not active during that period of time. Well, either it
was active over the past seven days, or it wasn't. We can't have it both ways.

5 of 7 10/7/2008 9:24 AM
Time and Time Again: Managing Time in Relational Databases, Part 3 - V... http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

With this retroactive delete, we have changed the past, and left no indication that we have done
so. A query counting active policies in the period 6/6/06 - 6/12/06, run sometime during the past
week, would include Mary's policy in that count. The identical query, run anytime after that event,
would not include her policy.

Even so, the assumption is that we were supposed to delete that row some time in the past. The
reason is that, just as a policy becomes effective on the date a row for that policy is inserted into
the Policy table, a policy ceases to be effective on the date a row for that policy is deleted from the
Policy table. We missed the date to delete the policy, and now we apparently have no way to
correct the mistake and show the date on which Mary's policy became no longer in effect.

Our dilemma is this: as long as the date of a physical insert into a table is also a business
effectivity start date, and the date of a physical delete is also a business effectivity end date, we
better be sure we can insert and delete on time! For if we miss either an insert or a delete date,
the only way we can correct the mistake is to lie about the past!

The dilemma is resolved, not by lying about the past, but rather by eliminating the homonyms that
are the source of the dilemma. Crt-dt is a homonym because it means both "the date the row is
inserted" and also "the date that the object represented by the row becomes effective." Del-dt is a
homonym for analogous reasons.

As we said at the beginning of this article, distinguishing database activity dates from business
effectivity dates creates what computer scientists call a "bi-temporal" pattern. We will have more to
say about bi-temporality in later installments, starting with the discussion of Version Pattern 5.

As for {SC2-3}, it simply says that semantically, a logical deletion must follow the same referential
integrity rules as does a physical deletion. So, for example, if there are any foreign key
dependencies on P138, then either the logical delete of P138 must be blocked, or else it must
trigger a cascade (logical) delete to all rows that refer to it.

As for {SC 2-4}, the reason we cannot change del-dt once it has become non-null is the same.
Doing so would change the past. Our database would give different answers, depending on when
it was queried. At least one of those answers would have to be false.

This concludes our discussion of Version Patterns 1 and 2. The next installment will discuss
Version Pattern 3.

References:

1. Tom Johnston. "Primary Key Reengineering Projects: the Problem."DM Review. February
2000.
2. Tom Johnston. "Primary Key Reengineering Projects: the Solution."DMReview.com. March 1,
2000.
3. Tom Johnston. "The Semantics of Surrogate Keys and Business Keys: Part
1."Datawarehouse.com. December 6, 2002.
4. Tom Johnston. "The Semantics of Surrogate Keys and Business Keys: Part 2."
Datawarehouse.com. January 13, 2003.

For more information on related topics, visit the following channels:

Data Management
Data Modeling

6 of 7 10/7/2008 9:24 AM
Time and Time Again: Managing Time in Relational Databases, Part 3 - V... http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

7 of 7 10/7/2008 9:24 AM
Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 4 - Preliminaries to
Version Pattern 3
Tom Johnston, Randall Weis
DM Direct, June 2007

The first two version patterns, described in Part 3, support a minimal form of queryable history in
business databases. But sometimes that isn't enough. For even if we use Version Pattern 2 -
updating in place with logical deletion - there are still two important ways in which history can be
lost.

First, as soon as the first update takes place, the row, as originally entered, is lost because the
update overwrites the row. So we lose the information that was true of the policy on the date it
became effective. Next, considering the entire row as a single unit of data, all updates but the
most recent one are also lost, including even the knowledge of whether or not there were any
previous updates. Sometimes this is acceptable; but sometimes it is not.

Note: this may suggest that we should manage versions at the level of individual row/columns, not
at the level of entire rows. Although that would certainly be possible, we believe that the same
semantics can be supported by versioning entire rows, and at a lower overall system cost. We will
not consider row/column level versioning in the remainder of these articles.

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

Now we will consider a more stringent business requirement for history.

{BR-3}. Support {BR-2}, and also keep a complete queryable history on policies, i.e., on all
inserts, updates and deletes to all policies.

As our history taxonomy indicates, our first choice is to keep this history as a chronological record
of either events or states. We could keep an event-based history of changes to policies by adding
a row to the Policy table each time a new policy was created, and after that keeping a transaction
table in which each transaction was an update or delete to the policy.

Event-based history is most appropriately used to manage changes to metric values of


relationships among persistent objects, values such as counts, quantities and amounts. These
relationships that contain metrics are often called balances. Each change to a relationship metric
is recorded as a transaction, and in the traditional star schema design, each transaction contains a
date, the change to each metric, and a foreign key to each object that is affected by the
transaction, objects such as customers, products, suppliers and so on.

1 of 6 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

However, event-based history is not appropriate for managing changes to non-metric values of
persistent objects. For them, state-based history is the preferred option. Because we have chosen
to keep a state-based history of policies, we must now decide whether to do so using snapshots or
versions. Snapshots are the preferred method for recording the coordinated history of many rows,
usually across many tables, as of a single point in time. However, if the same column in the same
row is updated more than once between snapshots, only the last update will be captured.
Therefore, since the business requirement is to track all changes to policies, we must do so using
versions.

What Version Tables Version

Before we dive into the details of managing versioned history, we must first understand what is
being logically inserted, updated and deleted in versioned tables. It is not versions. It is the objects
they are versions of. The physical insertion of a version logically inserts, updates or deletes an
object, and does so in the following way.

The physical insert of the first version for a policy is a logical insert of that object, i.e., that
policy.

The physical insert of subsequent versions for that policy, which are not designated as a
logical delete of the policy, are logical updates of the policy.

The physical insert of a version marked as a logical delete logically deletes the policy.

Note that "object" is being used here in the sense of "anything that persists over time," not in its
narrower object-oriented sense. Customers, products, suppliers, contracts, employees, facilities,
inventories, packages, invoices, purchase orders - all these are objects. For the most part, we are
using insurance policies as illustrative objects in these articles.

A Policy Version Table

Each row in the original Policy table represents a policy (of course). But no row in the Policy
Version table represents a policy! Instead, each row represents what we know about what a policy
looked like during a designated period of time.

Multiple rows on the Policy Version table may have the same policy-nbr, of course. If this were a
Policy table, that would be an error. But on a Policy Version table, it is not an error. On a Policy
Version table, those rows, and only those rows, with a given policy number are versions of that
policy. Those rows, and only those rows, contain information about that policy during a period of
time.

So a version is a timeslice of an object. It is not the object itself. A complete sequenced set of
versions is a full timeline for an object.

Since none of these rows in the Policy Version table represent policies, where are the rows that
do? If this table is not a table of policies, where is that table?

The answer is that, on our implementation of versioned history, there is no Policy table. The
reason is that we don't need one. For any point in time during which a policy was a policy, we can
retrieve a row from the Policy Version table that represents the policy as of that point in time. The
first row in the Policy Version table for a policy shows what the policy was like when it became

2 of 6 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

effective. Subsequent rows show what the policy was like after each of the changes that
happened to it and until the next one happened. If the Policy terminated, there is a version for that
event also.

A Simplifying Assumption: One and Only One Episode per Object

Before proceeding, we will introduce two simplifying assumptions into our initial discussion of
versions, assumptions which we will drop later on. The first one is this: objects can't recur.

For example, if Mary Jaspers' policy P138 terminated on a given date, but the next day Mary called
and said she wanted to continue her policy, then given this assumption, AHIC would have to issue
her a new policy. If Mary wanted no changes to her policy, the new policy would be identical to the
old one except for its policy ID number, but it would in fact be a new, distinct policy.

If an object could recur, what would that look like, as far as data is concerned? In the case of
Mary's policy, it would look like a logical delete of policy P138 followed, after one or more clock
ticks, by another version for P138, which is not a logical delete. Normally, persistent objects
re-appear after some period of time, i.e., after some possibly large number of clock ticks. So there
can be gaps between episodes of the same object, although if there is only one clock tick between
a logical delete and the next version of an object, those episodes would not have a gap between
them.

This is illustrated in Figure 1.

Figure 1: Episodes of an Object

Since that next version is an initial version of the policy after a prior logical delete of that policy, it
constitutes a reappearance of the policy. Let us call the first version for P138 (the Jan 1, 2005
version) the original initial version, and each other version which immediately follows a logical
delete of P138 (the Jul 15, 2005 version) a successor initial version. The count of initial and
successor versions for an object is a count of what we will call episodes of that object. Gaps may
or may not exist between successive episodes. In this case, the gap extends from one clock tick
after the first episode is terminated, to the clock tick on which the second episode begins.

If the number of logical deletes of an object is the same as the number of episodes, then, for as
long as that is true, the object does not currently exist on our database. That is because every
episode, including the most recent one, has been terminated by a logical delete.

The only other case that can arise is that the number of episodes is one greater than the number
of logical deletes. That is because, in a series of versions for the same object, what splits them
into episodes are the logical delete versions. In this case, the object does currently exist on our
database because there is no logical delete version for the most recent (or only) episode.

So our simplifying assumption can also be expressed like this: (for now), each persistent object
has one and only one episode. That episode, of course, can consist of any number of versions.

This single-episode constraint means that AHIC (our fictional insurance company) does not permit
policies to recur. So if a policy lapsed for even a single day, and the policy holder wanted her
policy reinstated immediately, she would be assigned a new policy. The initial version of that new
policy would become effective one clock tick after the deletion of her previous policy. This is not a
realistic way of doing business, of course, and later on we will consider scenarios in which a
lapsed policy may be reinstated.

3 of 6 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

Another Simplifying Assumption: No Gaps Within Episodes

Our second assumption is that there can be no gaps, within a given episode of an object, between
successive versions. One version must follow right after the other. That is, all noninitial versions of
an episode of an object must begin on the next tick of the clock after the version they supercede
ends. So in Figure 1, each arrow which is labeled "version {n} ends; version {n+1} starts" actually
points to two adjacent clock ticks.

A corollary is that each nonterminal version of an episode of a persistent object must end on the
tick of the clock immediately before the tick on which the version they precede begins. (See Part 2
for a discussion of clock ticks.)

Let us call a gap within an episode an intra-episodic gap, using the term "episode" as indicated
above, to refer to the time span for an object that reaches from the begin date of its initial version
to the end date of the first logical delete reached by following that object's versions in
chronological sequence. And let us call the first assumption, that objects have only a single
episode, the requirement that there be no inter-episodic gaps.

Note: Bear in mind that while we reference an event's commencement and conclusion as begin
date and end date, these temporal markers would frequently be a more granular date/timestamp.
As always, the key concept here is a "tick of the clock," at whatever level of granularity is
appropriate for your specific application.

Semantic Constraints

Our two simplifying assumptions come to this: until further notice, these articles are discussing
single-episode objects whose versions have no gaps. The semantic constraints imposed by these
assumptions are as follows:

{SC 3-1}. If a version for an object is followed by another version for that same object, the
end date of the former version must be one tick of the clock prior to the begin date for the
latter version.

{SC 3-2}. If a version for an object is not followed by another version for that same object,
and is not itself a logical delete version, the end date will be assumed to be unspecified.
(This is not always the case, of course; and in later patterns, we will drop this simplifying
assumption.)

{SC 3-3}. If a version for an object is a delete version, the object will be considered deleted
as of that delete date.

{SC 3-4}. It is semantically illegitimate for any version of an object to follow (to be
chronologically later than, based on its version date) a logical delete version of that object.

There are several issues raised by these semantic constraints, all of which we will discuss later on.
They include:

The semantics of "unspecified" end dates. For some versions which do have an end date,
that end date may be unknown at the time the version is entered. For others, which we can
think of as "until further notice" versions, there is no end date, known or not, at the time they
are entered. In the literature, an unspecified end date is sometimes called "forever." But

4 of 6 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

forever implies that we know the version will never terminate, and there aren't many objects
which have versions like that. And one may hope that if an end date for a version should be
known, it is known at the time the version is entered. If we can assume that, it leaves the
"until further notice" kind of version as the only kind of version with an unknown end date.
But its semantics is not the same as the semantics of forever. This issue will be discussed in
later installments in this series.

The existence of versions with a known end date. We must be able to manage versions of
objects with a known end date, and also versions of objects whose end date is not known.
As we will see later, this can prove quite challenging.

The fact, as already noted, that businesses often need to recognize the re-appearance of the
same object after a gap in time. This is the need to permit multiple episodes of an object on
the database. It is a common business requirement; customers come and go, products are
offered for sale, later retired, and later still reintroduced, etc.

The fact that businesses often need to keep two or more versions with overlapping or
identical effectivity periods.

This last point is especially important. In normal cases, versions cannot temporally overlap one
another. If they did, there would be a period of time during which both were effective. But if they
were both effective, how would we know which one was the truth? And what would the other one
represent, if not the truth? These are topics for later articles in this series.

Corrections, Gaps and Overlaps

To avoid intra-episodic gaps, we must write code to insure that every new noninitial version begins
exactly one tick of the clock after the version it is about to supercede. However, while necessary,
this is not sufficient to guarantee that there will be no intra-episodic gaps. A second way in which
such a gap can be introduced is with a correction.

Consider a non-initial version entered with an effective begin date that is earlier than it should be.
In that case, we need to move its begin date forward. But doing so would introduce a gap between
it and its precedessor version. By the same token, moving an effective end date back would
introduce a gap between it and its successor version. To prevent such corrections from violating
{SC 3-1}, we must always correct two versions when an effective begin date on a non-initial version
is moved forward or when an effective end date on a non-terminal version is moved backward.
One is the version itself that is being corrected. The other is the adjacent version, which must now
be corrected to "fill in the gap" created by the first correction.

We should not think of these two physical updates as two transactions. There is one transaction; it
is the movement of the transition point in time between two adjacent versions. This single semantic
update requires two physical rows to be updated.

Semantic constraint {SC 3-1} also requires that updating an effectivity date range must not create
an overlap. One caveat, of course, is that a correction which does not alter a version's effectivity
date range must have exactly the same date range as the version it corrects.

Version Patterns 1 and 2 were both "update in place" patterns. In this installment, we have laid the
groundwork for discussions of patterns which do not update in place, i.e., which do not overlay
data already on the database. First, we introduced two simplifying assumptions, which we will
eventually drop as we discuss increasingly complex patterns. Next, we introduced some concepts

5 of 6 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 4 - Pr... http://www.dmreview.com/dmdirect/20070601/1085255-1.html?type=pri...

which will prove useful in those later discussions. The concepts were:

Timeslice
Timeline
Original initial version
Successor initial version
Episode
Gaps
Intra-episodic gaps
Inter-episodic gaps

In Part 5, we will begin our discussion of Version Pattern 3.

For more information on related topics, visit the following channels:

Data Management

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

6 of 6 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 5: Ve... http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 5: Version Pattern
3
Tom Johnston, Randall Weis
DM Direct, June 2007

The business requirement for Version Pattern 3 is to be able to show what an object was like at
any point in its lifetime. To do this, we must retain the before-update state of an object, not just the
after-update state. We must retain a versioned history of changes to the state of an object. This
means that in place of the Policy table we have been using thus far, we will need a Policy Version
table.

{BR-3}. Support {BR-2}, and also retain a queryable history of changes to objects.

Tables and Keys

The primary key of our original Policy table is policy-nbr. So the primary key of our Policy Version
table must be policy-nbr plus something else. The purpose of that something else is to distinguish
each version of a policy from all other versions of that same policy.

Intuitive Data Reporting & Dashboards


Get instant visibility into your business with clickable access to your
data no matter where it resides. Customize reports & dashboards on
the fly.Try PivotLink now.

Pretty much anything will do to distinguish policy versions. In fact, any value guaranteed not to
recur for the same policy will do. The first thing that comes to mind is a sequence number. The
initial version is 1, and the subsequent n versions are 2, 3, ...... , n.

This isn't a bad idea, just as the logical delete flag, discussed in Version Pattern 2, was not a bad
idea. But just as we preferred a delete date to a delete flag, we prefer to use a date as our version-
discriminating part of the primary key for policy versions, and thus the combination of policy-nbr
and this date to uniquely identify each row in the Policy Version table.

Business Dates and System Dates

Version Pattern 3 does not distinguish business effectivity dates from physical database activity
dates. We must have a business effectivity date, of course. But for the purposes of this analysis,
we can consider this date as either the business effectivity start date for the version, or that date
and also the date the version was physically inserted. In the former case, we simply fail to record
the date of physical database activity. In the latter case, the two dates are identical, with
complications that we will examine later.

1 of 8 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 5: Ve... http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

The former case is simpler, but it is also almost never seen in production databases. Physical
insertion dates (and physical last update dates, in the case of updates in place) are almost always
included. So for Version Pattern 3, we will assume that version begin date is both the date the
inserted row became business effective and also the date the row was physically inserted.

Let's now recreate the series of events described in Scenario 1. We begin with the event that
created a new policy for Mary Jaspers. This event takes place on 1/12/04.

Scenario 3.1: A Policy Insert in the Policy Version Table

As we have already pointed out, the physical action of inserting the initial version for a policy is
also the logical action of inserting a new policy. Here is the result of inserting Mary's policy
effective 1/12/04:

Figure 1: Result of Inserting Mary Jaspers's Policy Effective 1/12/04

The primary key of the policy version is policy-nbr (the unique identifier for a policy) plus version
begin date. All versions for the same policy have the same policy number and are distinguished
from one another by their begin date. Because Version Pattern 3 is based on the assumption that
objects cannot recur (as described in Part 4), there will be no temporal gaps among versions for
the same object. Consequently, we also know the end date for all versions but the current one. It
is one day (one clock tick) prior to the begin date for the chronologically next version.

If the above row were inserted on any date later than 1/12/04, it would be a retroactive insert. If it
were inserted on any date earlier than 1/12/04, it would be a proactive insert. But as long as rows
must be inserted on their business effectivity dates, then neither retroactive nor proactive inserts
are possible. In Version Pattern 4, we will distinguish these two kinds of dates and consider
retroactive and proactive database activity.

Using this primary key presupposes that we can never have two versions of the same policy that
begin on the same date, i.e., on the same tick of whatever clock we are using. For now, we'll stick
with this assumption. Later on, we will examine an exception to it.

Figure 2: Timeline Diagram for Scenario 3.1

2 of 8 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 5: Ve... http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

By inserting this initial version, we have created policy P138, effective 1/12/2004. Now let's see
how the first update to this policy is represented in the Policy Version table.

Scenario 3.2: A Policy Update in the Policy Version Table

On 3/14/05, Mary changes her policy type from PPO to HMO (Health Maintenance Organization).
After that change is recorded in the Policy Version table, the table looks like this:

Figure 3: Policy Version Table After Policy Type Change

Because of our no-gap assumption, we know from this data that this policy was a PPO policy from
1/12/04 through 3/13/05 - the latter date being one clock tick prior to the begin date of the next
version.

Note: One way of representing date ranges is by a pair of dates (one of them inferred, in this
example). Another is to specify a begin date and a duration, i.e. a number of clock ticks. These
two ways of representing date ranges are semantically equivalent; each can handle all (and only)
the business requirements that the other can. They differ only in details of implementation.
Throughout these articles, we have chosen to represent date ranges with a pair of dates.

When a pair of dates is used, we must specify whether neither, either or both are included in the
range they specify. Again, the options are semantically equivalent. In these articles, we have
chosen to include begin dates in the range, but exclude end dates.

The timeline for policy P138, after this update, is shown below.

Figure 4: Timeline Diagram for Scenario 3.2

Note: graphics for timeline diagrams permit us to draw the lines indicating the start of a version
only in alignment with the start of a vertical bar. Vertical bars represent months. So our convention
will be to draw these version start lines at the start of the month of their ver-beg-dt. Thus, a
ver-beg-dt of 3/14/05 is drawn aligned with the start of the vertical bar representing March, 2005.

After this change, we do have an end date for our initial version, although it is inferred rather than

3 of 8 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 5: Ve... http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

expressed as an explicit data value. However, we still do not have an end date for Mary's policy
itself, as is graphically illustrated by the ellipsis on the above timeline.

Scenario 3.3: A Policy Deletion in the Policy Version Table

On 6/12/06, Mary's policy is terminated. But in the Policy Version table, as shown in Scenario 3.1,
we have no way to distinguish a version representing a policy deletion from one representing a
policy insert.

To distinguish versions which logically delete a policy, we could add a delete flag. But a delete
date carries more information than a delete flag. So while we do not need a second date to
manage policy inserts and updates, we do need one to manage policy deletions.

Notice that the version date in the primary key is called version begin date, not version create date
or version insert date. By the same token, we will call this second date object end date, not object
delete date. And, of course, we do not call it version end date. It is the end date of the object, i.e.,
the policy, not of a version of the object.

After we terminate Mary's policy, the Policy Version table looks like this:

Figure 5: Policy Version Table After Termination of Policy

We can identify the first row as the initial version for P138 because there is no other row for P138
with an earlier ver-beg-dt. We can identify the second row as the terminal version, i.e. the logical
delete point, for policy P138 because it is the only version for P138 whose obj-end-dt is not null.

The lifetime of policy P138 is from 1/12/04 to 6/11/06, inclusive. A query asking for what the policy
looked like, at any given point during its lifetime, can now be satisfied.

Notice that these two rows are a physically condensed way of expressing the same semantics that
would be expressed by having nearly 900 rows in the table, one for each day in the life of this
policy. These two rows are semantically equivalent to a set of nearly 900 consecutive daily
snapshots. This illustrates how, from a data storage perspective, versions are much more efficient
than snapshots.

Notice also that this is the first instance in which we have physically updated a row in the Policy
Version table. Prior to 6/12/06, the row with a 3/14/05 ver-beg-dt has a null obj-end-dt. From
6/12/06 forward, it does not. So we have overwritten that state of the row.

In doing so, have we lost information? In fact, we have not. We can infer from the second row
above that its obj-end-dt would have been shown as null prior to 6/12/06, and as not null from that

4 of 8 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 5: Ve... http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

date forward. So although we did physically overwrite that row, we lost no information about the
policy, or about what we knew about the policy at any point in its lifetime.

This shows why we think Dr. Snodgrass's metaphor of the relationship between nontemporal and
temporal tables of the same "things" is a poor one. He encourages us to think of a temporal table
as derived from a nontemporal one by adding various dates to it. So he would encourage us to
think of the Policy Version table as the Policy table with time management added to it. Indeed, in
his examples, he changes nothing in table names as he changes them from nontemporal to
temporal. (This metaphor is first encountered in {Snodgrass 2000, p.20}, where a Lot table is still a
Lot table after adding two pairs of dates. It is again encountered on p.113 (Adding History), p.117
"...adding valid-time support to a table..." and elsewhere.)

Throughout our discussion of Version Pattern 3, we have seen the profound difference between
objects and versions of objects. In temporal tables, we physically manage rows which represent
versions of objects. In the process, we logically/semantically manage those objects themselves.
But in that process, the table whose rows represent those objects disappears! We have not kept a
Policy table and added a Policy Version table. We have replaced the Policy table with a Policy
Version table; and that version table should not be thought of as the Policy table with time
management added to it.

The complete lifetime of policy P138 is shown on the timeline below.

Figure 6: Timeline Diagram for Scenario 3.3

It is only now, when the policy has been logically deleted, that we have full knowledge of its
timeline. It is only now that the last version on the timeline is not followed by an ellipsis.

The Semantics of Version Pattern 3

In Part 4, we defined four semantic constraints for Version Pattern 3. These constraints spell out
the implications of the single-episode, no-gaps assumptions that are in effect for this version
pattern. To repeat, those constraints are as follows:

{SC 3-1}. If a version for an object is followed by another version for that same object, the
(implied) end date of the former version must be one tick of the clock prior to the begin date
for the latter version.

{SC 3-2}. If a version for an object is not followed by another version for that same object
and is not itself a logical delete version, the end date will be assumed to be unspecified.
(This is not always the case, of course; and in later patterns, we will drop this simplifying
assumption.)

{SC 3-3}. If a version for an object is a delete version, the object will be considered deleted
as of that delete date.

5 of 8 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 5: Ve... http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

{SC 3-4}. It is semantically illegitimate for any version of an object to follow (to be
chronologically later than, based on its version date) a logical delete version of that object.

In this installment, we have defined two dates which are needed to implement Version Pattern 3 -
a version begin date and an object end date. Additional semantic constraints specify the rules for
populating these dates.

The first group of additional constraints specifies the rules for populating these two dates during
the insert of the first version for an object. Semantically, this is the action of inserting a new policy.

{SC 3-5}. Ver-beg-dt cannot be in the past or in the future, because it is also the system
insert date. To set such a date to anything but [now] would be a lie. We can't do anything in
the past, or in the future. [Now] is all we have.

{SC3-6}. Obj-end-dt must be null. We can't end something before it begins. We can't even
end something when it begins. We can only end something after it begins.

The second group of additional constraints specify the rules for populating these two dates during
the insert of a noninitial, nonlogical delete version for an object. Semantically, this is the action of
updating an existing policy.

{SC 3-7}. For every noninitial version of a policy, its ver-beg-dt must be at least one clock tick
past the ver-beg-dt of the immediately prior version for the policy. If it is not, then the two
versions will overlap for at least one clock tick. But that is invalid because there cannot be
two versions of the truth at any given point in time.

However, for Version Pattern 3, this condition should never arise. Since a ver-beg-dt is also the
system insert date, only an incorrect system clock could permit a version to have an earlier begin
date than the version which preceded it. It shouldn't be necessary to write code to enforce {SC
3-7} because if the system clock is incorrect, we have a much bigger problem on our hands and
will probably have to roll back and reapply updates anyway.

{SC 3-8}. Obj-end-dt must be null. If it is not, the policy is being deleted; and, by
assumption, this is not a delete transaction.

The third group of additional constraints specify the rules for populating object end date during a
logical delete. Semantically, this is the action of terminating a policy.

{SC 3-9}.To terminate a given policy, set the object end date of the current version (which
may or may not be the initial version) to [now]. The object end date cannot be in the past or
in the future, because it is also the system insert date.

What Comes Next

Version Pattern 3 is a significant achievement. It retains the knowledge of the state of an object at
any point during the lifetime of that object. Indeed, what more could there be? What more could
businesses want that versioned tables can provide?

The answer is: quite a lot. To begin with, we might want to see what would happen with Version
Pattern 3 if we distinguished business dates from system activity dates. This would mean that
version begin date and object end date are business dates only. For system activity dates, we
would then want to keep an insert date and a version update date. That latter date will be {null}

6 of 8 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 5: Ve... http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

except when the version is updated. And because with versions we are no longer considering
updates in place, this update can only be one thing - the date the object end date was specified.

In the academic literature, versioning which has both a business set of dates and a system activity
set of dates is said to implement a "bitemporal" pattern.

In providing an annotated list of version patterns, in Part 2, we did not distinguish a bitemporal and
nonbitemporal variation of Version Pattern 3. (In addition, the description of Version Pattern 3 was
incorrect, and should have stated the opposite of what it did state.) So following is a revised
annotated list of the version patterns which we are discussing in these articles.

This list provides a correct description for pattern 3. It "bumps down" the other patterns, and
specifies a pattern 4 which is pattern 3 with business and system activity dates distinguished.
Finally, it reverses the sequence of the original list's ninth and tenth patterns in order to put the
two patterns which describe snapshots next to one another.

7 of 8 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 5: Ve... http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

Figure 7: Annotated List of Version Patterns - Revision 1

For more information on related topics, visit the following channels:

Data Modeling
Databases
Real-Time Enterprise

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

8 of 8 10/7/2008 9:26 AM
Time and Time Again: Managing Time in Relational Databases, Part 6: Ve... http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 6: Version Pattern
4
Tom Johnston, Randall Weis
DM Direct, June 2007

Version Pattern 3, discussed in Part 5, keeps a record of the state of an object at every point
during its lifetime. This results in an unbroken chronological series of versions of the object and
guarantees that no versionable updates will be lost.

With Pattern 3 and Pattern 4, each version of an object is distinguished from all other versions of
the same object by means of a version begin date.

The start of an object's lifetime is marked by the chronologically earliest version of that
object.
The end of an object's lifetime is marked by a chronologically latest version which contains
an object end date.
The current version of an object whose lifetime is not yet ended is marked by a
chronologically latest version which contains {null} in its object end date.

Notice that these versions have no version end date. This is because Patterns 3 and 4 assume
that every non-initial version follows its predecessor immediately, i.e., that when a new version is
added, the previous version ends its business effectivity exactly one clock tick prior to the start of
business effectivity of the newly added version. In other words, with Patterns 3 and 4, inserting a
new version into a table determines both the begin date for that version, and also the (implicit) end
date for what had been the current version until that insert took place.

Integrating Data: An Open Source Approach


Any technology deployed in the information system needs to
interoperate with existing applications or databases. Learn through
real-life scenarios how Open Source data integration solves the
interoperability challenge. Free White Paper.

A second assumption that characterizes Patterns 3 and 4 is that after an object has been deleted,
it cannot reappear in the database. More technically, and using our Policy table example, given a
version with a non-null object end date, no version with a later version begin date can have the
same policy number. Equivalently, any version with a later begin date must be a version of a
different policy. In Part 4, we introduced the concept of an episode, and characterized this second
assumption as the assumption that no object can have more than one episode.

These two assumptions are semantic constraints on Patterns 3 and 4. If your business
requirements for versioned history rule out either or both of these assumptions, then neither of
these patterns will meet your needs.

Patterns 3 and 4

1 of 6 10/7/2008 9:27 AM
Time and Time Again: Managing Time in Relational Databases, Part 6: Ve... http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

A version pattern is most precisely defined by the combination of a) its business requirement
(noted as {BR-x}, where x is any number) which the pattern must satisfy, and b) its semantic
constraints (noted as {SC-x-y}, where x is the number for the pattern, and y is the number for the
constraint). In these terms, Pattern 4, which is the pattern we will examine in this article, satisfies
the same business requirement and operates under the same semantic constraints as Pattern 3 -
with one exception. That exception is semantic constraint {SC 3-5}:

{SC 3-5}. Ver-beg-dt cannot be in the past nor in the future because it is also the system
insert date. To set such a date to anything but [now] would be a lie. We can't do anything in
the past, nor in the future. [Now] is all we have.

With Pattern 4, this constraint is changed. Semantic constraints 1 - 4 and 6 - 7 are identical to
their corresponding Pattern 3 constraints. But Pattern 4 has no constraint corresponding to {SC
3-5} because it uses distinct dates to indicate when the version becomes effective (ver-beg-dt) and
when the version is inserted into its table (crt-dt).

Bi-Temporality

With Pattern 4, we introduce a second set of dates. The first set is business effectivity dates. For
Patterns 3 and 4, they are:

Version-begin-date (ver-beg-dt): the clock tick on which that version of the object becomes
the currently in effect version.
Object-end-date (obj-end-dt): the last clock tick on which the last version of the object is in
effect.

The second set, included in Pattern 4 but not Pattern 3, also consists of a pair of dates. They are:

Create-date (crt-dt): the clock tick on which that row is physically inserted into its table.
Last update date (last-updt-dt): the clock tick on which that row is physically updated.

Normally, versions aren't updated. And indeed, versions are never logically updated. The
information they contain is never lost. But as noted in Part 5, there is one case in which a version
may be physically updated. That happens when a transaction is received which directs the system
to logically delete the object. This logical delete of the object is implemented by physically
updating the current version, changing its object end date from {null} to the date of the logical
delete. When this physical update takes place, the last update date records the clock tick on
which it does. In all other cases, last update date is null.

These two sets of dates are orthogonal to one another. Changes to one set do not necessarily
entail changes to the other set.

Computer scientists have called the use of these two sets of dates the "bi-temporal" data pattern.
The difference between Patterns 3 and 4, then, is solely that Pattern 3 is not bi-temporal while
Pattern 4 is.

To work out the implications of this difference between the two patterns, let's begin by recreating
the series of events described in Scenario 1.

Scenario 4.1: a Proactive Insert

2 of 6 10/7/2008 9:27 AM
Time and Time Again: Managing Time in Relational Databases, Part 6: Ve... http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

With Pattern 4, it is now possible to physically insert a row on a different date than the date it
becomes effective. So let's assume that Mary's policy, entered into the database on 1/12/04, does
not become effective until 6/1/04. (Correction: In scenario templates, the fourth cell down on the
left-hand side specifies the date or dates on which the version could validly have been inserted. In
Part 5, Scenario 3.1, that cell contains "1/12/04 or any later date." It should contain "N/A," because
for that Version Pattern, there is no choice about when the insert takes place; it must take place
on the ver-beg-dt.)

The primary key of the policy version is policy-nbr (the unique identifier for a policy) plus version
begin date. All versions for the same policy have the same policy number, and are distinguished
from one another by their version begin date. In addition, for all versions but the current one (if
one is currently in effect), we also know the dates on which they end their effectivity. If there is a
terminal version, it is the clock tick in its object end date. Otherwise, it is one clock tick prior to the
begin date of the chronologically next version.

The insert of the row shown above is a proactive insert. It takes place some four and a half months
prior to the clock tick on which the inserted row becomes the in-effect row for that object. This
insert is shown on the timeline diagram below.

With proactive inserts, there is a period of time between when the row was physically inserted and
when the version it represents becomes effective. In the timeline diagram for Scenario 4.1, this
period of time is indicated by the shaded block at the front of the version.

One situation in which proactive inserts are valuable is when there are multiple versions that
become effective on the same clock tick. If our clock ticks once per day, then we have 24 hours to
insert all those versions. Usually, that will be enough time to handle all but the largest volume
situations.

But suppose our clock ticks once per second. With Pattern 3, we cannot insert versions rapidly
enough to give more than a small number of them the correct ver-beg-dt. This immediately
suggests that we should assign the correct ver-beg-dt to versions, rather than reading that date
from the system clock once for each version.

Indeed, the more granular our clock ticks are, the less reasonable it is to design a system in which

3 of 6 10/7/2008 9:27 AM
Time and Time Again: Managing Time in Relational Databases, Part 6: Ve... http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

each version's ver-beg-dt is populated from a read of the system clock. So suppose we redesign
our system to assign version begin dates rather than base them on a read of the system clock.
Since we also need to record the clock tick on which rows are physically inserted - for all the usual
reasons, such as backing out an entire batch run - we need to add an additional pair of dates,
which we will call create date and last update date. Doing this gets us from Pattern 3 to Pattern 4.

Retroactive Inserts?

By inserting this initial version, on 1/12/04, we have created policy P138, effective 6/1/2004. But
with increased expressive power - here the power to physically insert versions on dates other than
their effectivity dates - comes an increased possibility for making a mistake.

One such mistake is a retroactive insert. Suppose that instead of inserting Mary's policy on
1/12/04, we instead inserted it on 8/1/04.

Before we explain why this is an error, let's note that there is nothing the database management
system (DBMS) can do to prevent it. Put another way, SQL DDL does not let us tell the DBMS that
retroactive inserts are invalid. Until DBMSs are upgraded with robust temporal management
capabilities, we are stuck with a do-it-yourself approach. In this case, we must write code that
prevents an insert whenever the transaction's create date is later than its version begin date.
Because this is an integrity rule which applies to all versions in the database, it should not be left
to individual applications to enforce. It should be enforced by the database itself, by means of a
pre-insert trigger.

Why Not?

The reason that retroactive inserts are not valid is easy to see. They change the past. Prior to the
retroactive insert, the past looked one way; afterwards, it looks different.

Consider policy P138's version that has a 6/1/04 effective date but that we didn't insert until 8/1/04.
Prior to the retroactive insert, the past from 6/1/04 up to 8/1/04 looked like there was no policy
P138 in effect at any point during that period of time. Afterwards, it looks like there was a policy
P138 in effect during that period of time. But both things can't be true of the period 6/1/04 - 8/1/04;
either policy P138 was in effect, or it wasn't!

It's certainly true that during any period of time, including the 6/1/04 - 8/1/04 period, either P138
was in effect or it wasn't. But suppose that it was. In that case, for those two months, the database
was incorrect. Surely there must be a way to correct mistakes?

Well, with mistakes, we can either correct them or we can ignore them. If we correct them, we can
either eradicate all traces of the error or we can preserve the evidence. The choice depends on the
cost/benefit ratio of each alternative. Ignoring errors, actually doing nothing about them, is usually
not an option when we are dealing with production databases, so we won't consider that option
any further.

4 of 6 10/7/2008 9:27 AM
Time and Time Again: Managing Time in Relational Databases, Part 6: Ve... http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

When we are dealing with transactionally updated databases, accountants know that the way to
correct an error is to a) create a transaction which offsets the bad transaction, and then b) add the
correct transaction. Although our Policy table is not transactionally updated, it is versioned, so
something similar should take place. The error should not be erased or overwritten. It should be
"sequestered" in some way, so that it is invisible to "normal" queries but nonetheless retrievable
with specially written queries. Then the correct version should be inserted in its place.

This is the "preserve the evidence" approach to error correction. But it is not always the best
choice. Consider that correcting mistakes while at the same time not removing all traces of them
from queryable history (see Part 1 for an explanation of the term "queryable history") requires
complex logic for both updating the database and subsequently querying the database after the
correction. The cost of developing this logic is not trivial. Moreover, the potential for mistakes in
queries against a database containing such corrections - especially mistakes in end-user-written,
non-IT-managed queries - is very real and can be as costly as you care to imagine. If there are no
legal requirements to retain a queryable record of the mistake, for example if the only legal
requirement is to be able to pull archive tapes when auditors ask for them, then the benefit of
making both the correction and the error immediately queryable is not very great. In a case like
this, the best approach is to snapshot the database (so the error can later be preserved in archival
storage), and then proceed with the retroactive insert (or the update in place of an existing version,
if that is what is required).

However, suppose our queryable database is a database of record for the data in question
(policies, in our case). Suppose, further, that there is either a legal or a customer relations
downside to telling one story about whether or not policy P138 was in effect throughout 6/1/04-8
/1/04, when asked prior to 8/1/04, and subsequently telling the opposite story when asked.

A legal downside is easy to imagine. If a report on policies in effect was run originally on 7/31/04,
and sent to the appropriate regulatory agency, but a week later rerun and sent again to that
agency, the two reports are not going to agree about P138, or about any counts or other statistics
it might contribute to.

A customer relations downside is equally easy to imagine. If our customer service representative
(CSR) tells a customer one story about her policy on one call and a contradictory story on a later
call, the customer is not likely to be happy and won't feel increased confidence in the reliability of
our company.

Downsides that cannot be contained within the enterprise, such as regulatory, customer
relationship and even vendor relationship downsides, are usually taken far more seriously by
executive management than are downsides that affect only internal operations. With externally
visible downsides, the cost/benefit ratio quickly swings to the "preserve the evidence" approach to
error correction.

But neither Version Pattern 3 nor 4 supports correction with error retention. So if your business
requirements for versioned history do not require error retention in queryable history and do permit
you to manage nonrecurring continuously existing objects, then you may still be able to use one
or both of these patterns. Otherwise, neither Pattern 3 nor 4 will meet your needs. We will have to
wait for a later Version Pattern to get error correction with error retention.

In Part 7, we will finish our discussion of Version Pattern 4.

For more information on related topics, visit the following channels:

5 of 6 10/7/2008 9:27 AM
Time and Time Again: Managing Time in Relational Databases, Part 6: Ve... http://www.dmreview.com/dmdirect/20070629/1087439-1.html?type=pri...

Data Modeling
Databases
Real-Time Enterprise

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

6 of 6 10/7/2008 9:27 AM
Time and Time Again: Managing Time in Relational Databases, Part 7: ... http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 7: Completing
Version Pattern 4
Tom Johnston, Randall Weis
DM Direct, July 2007

As we saw in Part 6, what distinguishes Pattern 4 from Pattern 3 is bi-temporality. With Pattern 4,
the business effectivity begin date of a version can be different than the date the version was
physically inserted into its table. This was not possible with Pattern 3.

Consequently, bi-temporality makes it possible to physically insert a row representing a new policy
proactively, prior to the date the policy becomes effective. As long as queries are careful to filter
out rows whose effectivity lies in the future (unless those queries are expressly written to retrieve
such rows), the ability to proactively insert new objects is often very useful. With this ability, we
can "insert and forget," meaning that we can insert a new object as soon as we know when it will
become effective, and then not have to do anything later on to "activate" it. The mere passage of
time will ensure that the object becomes a "now in effect" object on the indicated effectivity date.

By the same token, bi-temporality also makes it possible to physically insert a row representing a
new policy retroactively, after the date the policy becomes effective. We concluded Part 6 by
discussing why retroactive inserts were semantically invalid, and why they must therefore be
prevented. A retroactive insert of a new object, n days after its business effective begin date, is
semantically invalid because it changes the past. Prior to the insert, a query about any point in
time within n days prior to the time of insert, would show that the object did not then exist. After
the insert, the same query would show that the object did exist at that point in time. One query,
contradictory results.

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

Let's turn now to the two other scenarios we have been using to illustrate all of these version
patterns - updates and deletes.

Scenario 4.2a: A Type 1 Proactive Update

On 1/12/04, policy P138 is created for Mary Jaspers. As we have already described, the creation of
a new policy is implemented by inserting an initial version for that policy into the Policy Version
table.

However, policy P138 does not become immediately effective. It becomes effective on 6/1/04. It is,
thus, a proactive insert of a policy. Since inserting a new policy and updating an existing policy are
both implemented by physically inserting a row in the Policy Version table, proactive updating of a

1 of 5 10/7/2008 9:28 AM
Time and Time Again: Managing Time in Relational Databases, Part 7: ... http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

policy should follow the same pattern as proactive insertion of a policy. Proactive deletion should
be similarly straightforward. But as we shall see, things are a little more complicated than that.

While there is only one type of proactive insertion of a new policy, there are two types of proactive
updates. With the first type, Mary requests a change to her policy to become effective after the
policy itself becomes effective. With the second type, Mary requests a change to her policy to
become effective after it is entered into the database, but prior to its effective date.

To illustrate a Type 1 proactive update, let's assume that on 6/18/04, Mary requests a change in
policy type from PPO to HMO, to become effective on 9/1/04, and that this change is physically
recorded in the Policy Version table on 7/1/04. This is indeed a proactive update since the create
date is two months prior to the effectivity date.

After inserting the new version (a physical insert which semantically updates the policy), the Policy
Version table looks like this:

What these two rows tells us is semantically straightforward. Policy P138 goes into effect on
6/1/04, as a PPO policy. Three months to the day later, it remains in effect but becomes an HMO
policy.

Scenarios 4.2b: a Type 2 Proactive Update

But there is a second kind of proactive update to consider. What will the Policy Version table look
like if, instead of the update just illustrated, we instead enter an update on 3/10/04, which also
changes P138 from PPO to HMO, but this time becoming effective on 3/15/04?

Clearly this is also a proactive update, since it is entered prior to its effectivity. So shouldn't things
be just as straightforward as with the Type 1 update? Let's see.

After inserting the new version (a physical insert which semantically updates the policy), the Policy
Version table looks like this:

2 of 5 10/7/2008 9:28 AM
Time and Time Again: Managing Time in Relational Databases, Part 7: ... http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

What these two rows tell us may be semantically straightforward, but it is also wrong! In other
words, the result is not what we intended. These two rows tell us that Policy P138 will go into
effect on 3/15/04, as an HMO policy, and that two and a half months later, it will remain in effect
but will become a PPO policy.

Type 1 and Type 2 Proactive Updates. What's the Difference?

The difference between Type 1 and Type 2 proactive updates does not lie in the update
transactions themselves. We may assume that the schema for the transactions is the same in both
cases. Neither does the difference lie in the code processing the two transactions. We may
assume that it is the same code.

What, then, is the difference? Why do Type 1 proactive updates "make sense," while Type 2
proactive updates do not?

The relevant difference between these two types of proactive update seems to be that Type 1 is an
update to a policy that is already in effect, i.e., that has a version in effect at the time the new
version is physically inserted. Type 2, however, is an update to a policy that has not yet gone into
effect.

But how does this difference result in the difference between a semantically correct database state
and a semantically incorrect one?

One way to look at it is this. Version Pattern 4 does not permit us to enter versions with a known
effective end date. All versions entered are assumed to be valid from their effective begin date
"until further notice." So when we enter a version of P138 that becomes effective prior to a version
already in the table, we enter it with the understanding that it is effective until further notice, i.e.,
effective until we take some other action regarding policy P138.

The semantics of a Type 2 proactive insert, one whose effective begin date is prior to the effective
begin date of an existing version, violates this understanding. It violates the semantics of proactive
activity carried out in the absence of effective end dates because, without any further action being
taken with respect to policy P138, the version which was entered first will change its status from
being the insertion of a new policy to being an update of an existing policy.

3 of 5 10/7/2008 9:28 AM
Time and Time Again: Managing Time in Relational Databases, Part 7: ... http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

From Mary's point of view, she understood the semantics of her two requests to be this:

In the case of the Type 1 update, Mary's second request is to change an existing policy. She
requests that her policy, in effect since 6/1/04, be changed to an HMO policy three months
later. And this is what she gets.
In the case of the Type 2 update, Mary's second request is not to change an existing policy.
Instead, with this second request, Mary changes her mind about a policy which is not yet in
effect. Instead of a new PPO policy becoming effective on 6/1/04 and remaining in effect
until further notice, she now wants an HMO policy becoming effective on 3/15/04 and
remaining in effect until further notice. But this is not what she gets.

Again, the exact same code is used to make both changes. In the first case, Mary got what she
wanted. A database update correctly implements her change request. But in the second case, that
same code produces a disaster in the making. For two and a half months after the Type 2 update,
everything works as it should. Then, without any other database change taking place, on 6/1/04,
Mary's policy suddenly becomes an HMO policy.

In the first case, Mary wants a change to something already in effect. In the second case, she
wanted a replacement to a request that has not yet been implemented.

We can easily imagine that the business analyst who gathered the requirements for a versioned
history of policies failed to distinguish between changes and replacements. After all, there are only
three things you can do to rows in a table. You can insert them, update them or delete them.
What the user calls a change and what she calls a replacement (or perhaps calls "changing her
mind") are, to the business analyst, both physical changes to something already in the database.

In other words, to the business analyst, Mary's changing a policy and Mary's changing her mind
amount to a distinction without a difference. As Paul Newman said in Cool Hand Luke, "What
we've got here is a failure to communicate."

Proactive Deletion

We need to discuss proactive deletions only briefly. A Type 1 proactive deletion terminates an
existing policy. A Type 2 proactive deletion terminates a policy that does not yet exist. But because
you can't do anything to something that doesn't exist, we need to say what we mean more clearly
than this.

In a non-versioned table, one row represents one object. In a Policy table, each row would
represent a policy. But in the Policy Version table, no row represents a policy and each row
represents one time slice of one policy. Consequently, actions to the policies themselves do not
line up, one for one, with actions to rows in the Policy Version table.

With non-versioned tables, whose rows are one for one with objects, semantics are generally one
for one with physical activity. For example, physically inserting a row is semantically inserting an
object; physically updating a row is semantically updating an object; and so on. But with versioned
tables, none of whose rows represent objects, semantics are not similarly isomorphic with physical
activity. For example, physically inserting a row may be semantically inserting an object; but it may
not be. Instead, it may be semantically updating an object. And the only physical updates to
versioned tables, at least with the version patterns considered so far, are done to implement
semantic deletions.

So to speak more clearly about deletions, we should say this: a Type 1 proactive deletion

4 of 5 10/7/2008 9:28 AM
Time and Time Again: Managing Time in Relational Databases, Part 7: ... http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

terminates an existing policy by supplying a termination date for the version of the policy that was
current at the time of the physical Type 1 action. A Type 2 proactive deletion terminates the
version of a policy which has the latest effective begin date, but does so prior to that date. Thus,
unless some other action takes place, a Type 2 proactive deletion ensures that its corresponding
policy will never become effective.

What Comes Next

We will not itemize the semantic constraints which define Version Pattern 4. Enough has already
been said that the reader should be able to modify the semantic constraints for Pattern 3 and
create a set of constraints that correctly define Pattern 4.

Next time, we will begin our discussion of Version Pattern 5. It is simpler than Pattern 4 in that it is
not a bi-temporal pattern. But it is more complex than Pattern 4 in that it drops the non-recurring
assumption. Once we have learned to handle objects that appear for awhile, vanish for awhile,
and then show up again, and to do so without bi-temporality, we will then re-introduce
bi-temporality. That will take us to Version Pattern 6 which is the penultimate versioning pattern.

For more information on related topics, visit the following channels:

Data Modeling
Databases
Real-Time Enterprise

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

5 of 5 10/7/2008 9:28 AM
Time and Time Again: Managing Time in Relational Databases, Part 8: Ve... http://www.dmreview.com/dmdirect/20070803/1089849-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 8: Version Pattern
5
Tom Johnston, Randall Weis
DM Direct, August 2007

Version Pattern 5 is the first pattern to manage objects with multiple episodes, a concept
introduced in Part 4. To avoid taking on too much at once, however, Pattern 5 will not include
bi-temporality. That will come with Pattern 6 which will be, consequently, a pattern supporting
both multiple episode objects and bi-temporality.

Up to this point, the version patterns we have examined have incorporated the assumption that
once an object is logically deleted, it will never recur. In Part 4, we called this the "single episode"
assumption. Technically, this assumption means that a) there can be at most one logical delete
version for an object; and b) that no other version for the same object can have an effectivity begin
date later than the logical delete date.

Single Episode Objects

In the case of single episode versioning, it is reasonable to ask, "What kind of objects never
recur?" For if there aren't any such objects, then the patterns considered thus far have no real
world applicability.

Intuitive Data Reporting & Dashboards


Get instant visibility into your business with clickable access to your
data no matter where it resides. Customize reports & dashboards on
the fly.Try PivotLink now.

The answer to this question depends on what we mean by "recur". So, at this point, we could
embark on a general discussion of the concept of recurrence. That discussion would include an
examination of the ontology of types vs. instances, and of enduring (persisting through time)
objects vs. events (processes in and through which objects change over time).

The types vs. instances distinction is, in relational databases, the distinction between a table and
its rows, or an entity and its attributes. In object-oriented theory, it is the distinction between object
classes and their objects, and there the ontological question looms large, as the question of
whether or not classes are "first-class objects". These questions are the current manifestation of
what the medieval Scholastics (especially Ockham and Duns Scotus) knew as the realism /
nominalism debate.

For realists, types really existed, just as truly as their instances did. Realists would have treated
classes as themselves objects. For nominalists, instances were what really existed; a type was just
a name given to the common features we observed across many different objects. Nominalists
would not have treated classes as objects, able to exert influence on other objects, for example.

1 of 5 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 8: Ve... http://www.dmreview.com/dmdirect/20070803/1089849-1.html?type=pri...

What about the second philosophical distinction, that between objects and events? Some
proponents and followers of the Kimball methodology believe that everything of importance can be
represented in dimensional data structures. From the perspective of the history of Philosophy, we
would see that this position reflects a commitment to the ontological priority of events over objects.
We would see that the approach to change which we have taken in this series of articles, which
involves preserving both the before and after states of objects undergoing change, reflects a
commitment to the ontological priority, or at least equivalence, of objects to events.

This, we believe, would be an important discussion. The history of Philosophy has much to tell us
about problems which IT professionals and computer scientists alike are still struggling with. A
discussion like this would illustrate how such business IT-specific issues as we have just
mentioned are but the current manifestation of two ways of thinking about the world, whose roots
go back at least to the pre-Socratic philosophers. Although object-orientation (in the sense of the
ontological priority of "things" and "stuff") has dominated Western metaphysics for two-and-a-half
millennia, there has been a strain of process-orientation (in the sense of the ontological priority of
events and occasions). This latter strain originates, perhaps, in the reflections of Heraclitus, and
his famous dictum "panta rei" - "all things change". It reached perhaps its fullest expression some
eighty years ago, in the work Process and Reality, by Alfred North Whitehead.

Don't Know and Don't Care

The problem with the single-episode assumption is that things exist over time, and so always
contain within themselves the possibility of recurrence - of appearing to us, going away, and then
returning. So, by their very natures, objects are fraught with multi-episode possibilities.

The reason that version patterns incorporating the single-episode assumption may still have some
real-world use, is that there are objects we may be interested in whose recurrence is not of interest
to us. In other words, there may be objects which, when they do recur, we treat as new objects.
Either we don't notice that they are the same objects we encountered before, or we do notice but
don't care.

A simple example is a policy holder. Suppose that Joe Smith is a policy holder with our company
from 2002 to 2004, at the end of which year he lets his policy lapse. A year later, Joe takes out
another policy. If our company makes no effort to identify Joe as the same person who held a
policy from 2002 to 2004, then when he returns, in 2006, Joe is effectively a new policy holder for
us - not just a holder of a new policy, but a new holder of a policy.

Twenty years ago, to take another example, it was common for banks to manage "accounts".
Some effort was made to identify cases in which one person held several accounts concurrently,
but almost no effort was made to identify cases in which one person held several accounts
successively.

So the answer to the question "What kind of objects are single-episode objects?" is clear. Single-
episode objects are those whose recurrence we either don't know about, or don't care about.

But with Version Pattern 5, we begin to deal with multi-episode objects. These, then, are objects
whose recurrence we do make an effort to identify and track.

Across industries that deal with people as customers, the same change has taken or is currently
taking place. Insurance companies who managed a relationship with policy holders now want to
manage a relationship with persons who are, or who may become, policy holders. Banks which
used to manage the account relationship now want to manage the customer relationship. In some

2 of 5 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 8: Ve... http://www.dmreview.com/dmdirect/20070803/1089849-1.html?type=pri...

industries, companies are even striving to recognize the appearance and re-appearance of
households.

Multi-Episode Objects

If an object could recur, what would that look like, as far as data is concerned? In the case of
Mary's policy, it would look like a logical delete of policy P138 followed, after one or more clock
ticks, by another version for P138 which is not a logical delete.

Normally, persistent objects re-appear after some period of time, i.e. after some possibly large
number of clock ticks. So there can be gaps between episodes of the same object, although if
there is only one clock tick between a logical delete and the next version of an object, those
episodes would not have a gap between them.

This is illustrated in Figure 1.

Figure 1: Episodes of an Object

If the number of logical deletes of an object is the same as the number of episodes, then, for as
long as that is true, the object does not currently exist on our database. That is because every
episode, including the most recent one, has been terminated by a logical delete.

The only other case that can arise is that the number of episodes is one greater than the number
of logical deletes. That is because, in a series of versions for the same object, what splits them
into episodes are the logical delete versions. In this case, the object does currently exist on our
database because there is no logical delete version for the most recent (or only) episode.

Semantic Constraints for Pattern 5

Pattern 5 is identical to Pattern 3, but with the single-episode constraint dropped. So instead of
presenting our usual insert/update/delete scenarios, we will develop Pattern 5 by beginning with
the semantic constraints (called "assumptions" when that is the more convenient terminology) for
Pattern 3, and seeing what changes need to be made to them to accommodate the multi-episode
functionality.

We start with the first constraint.

Semantic Constraint {SC 3-1}.

{SC 3-1} states that "if a version for an object is followed by another version for that same object,
the end date of the former version must be one tick of the clock prior to the begin date for the
latter version." This constraint says that versions must have no gaps in time between them. But

3 of 5 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 8: Ve... http://www.dmreview.com/dmdirect/20070803/1089849-1.html?type=pri...

multiple episodes of an object nearly always will have gaps in time between them. For example, a
policy expires, because of a late payment. But a week later that policy is re-instated because the
company granted a grace period in this case, in order to retain a good customer.

So to accommodate multiple episodes, {SC 3-1} must be modified to apply only to versions which
do not follow a terminal (logical delete) version. This gives us the corresponding constraint for
Pattern 5.

{SC 5-1}. If a non-terminal version for an object is followed by another version for that same
object, the end date of the former version must be one tick of the clock prior to the begin date for
the latter version.

Semantic Constraints {SC 3-2 and 3-3}.

{SC 3-2} states that "if a version for an object is not followed by another version for that same
object, and is not itself a logical delete version, the end date will be assumed to be unspecified".
And {SC 3-3} states that "if a version for an object is a delete version, the object will be considered
deleted as of that delete date". Both of these semantic constraints apply to Pattern 5 as well.

Semantic Constraint {SC 3-4}.

{SC 3-4} states that "it is semantically illegitimate for any version of an object to follow (to be
chronologically later than, based on its version date) a logical delete version of that object."

This is the single-episode constraint. So it is precisely the constraint that is dropped for Pattern 5.

Semantic Constraints {SC 3-5 and 3-6}.

The next two semantic constraints for Pattern 3 explicitly apply to a logical insert, i.e., to the first
version of an object.

{SC 3-5} states the constraint on the effectivity begin date. It says, of the first version of an object,
that "ver-beg-dt cannot be in the past or in the future, because it is also the system insert date. To
set such a date to anything but [now] would be a lie. We can't do anything in the past, or in the
future. [Now] is all we have."

{SC3-6} states the constraint on the logical delete date. It says, of the first version of an object,
that "obj-end-dt must be null. We can't end something before it begins. We can't even end
something when it begins. We can only end something after it begins."

What's different about Pattern 5 is that there can be multiple first versions of an object, one for
each episode. So the question is whether these two constraints apply to initial versions of any
episode, or just to the initial version of the first episode.

The answer is that these two constraints apply to initial versions of any episode of an object. No
matter what episode we are dealing with, it is obviously true that (without bi-temporality) we can't
do anything in the past, or in the future. And it is equally obvious that (without bi-temporality) we
can't end something before it begins, or even when it begins.

Taking Stock of Where We Are

We will complete Version Pattern 5 in Part 9. But here, at a point about a third of the way through

4 of 5 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 8: Ve... http://www.dmreview.com/dmdirect/20070803/1089849-1.html?type=pri...

this series of articles, we want to remind you why we are considering several versioned history
patterns, and also why the discussion has been so painstakingly detailed.

There are two reasons for considering several version patterns. The first is that each pattern can
be used to satisfy any set of business requirements which conform to the semantic constraints
stated for the pattern. Earlier patterns are less complex than later ones, and some developers may
prefer to use simpler versioned history patterns when their system requirements permit it.

The second reason for considering several version patterns is to gradually build up to what is the
most powerful versioned history pattern that we know of. As we said in Part 1, the versioned
history pattern we have evolved is the most complex data pattern for its size that we know of. It is
best, pedagogically, to approach it one step at a time.

Another pedagogical feature of these articles is their discursive style. Often, data patterns or data
models are presented as completed objects. Little effort is spent on explaining the thought
processes that led up to them.

We believe, however, that the thought processes are, if anything, even more important than the
completed pattern. If we can teach senior data modelers how to think about time, by explaining
how we thought through the issues and why we chose the solutions we did, then the final
versioning pattern we present should seem almost inevitable. If, at the end of this series of
articles, that pattern does seem inevitable or at least completely plausible to you, then we will
have given you not just a tool for managing time in databases, but also the skills to build other
such tools.

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

5 of 5 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 9: ... http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 9: Completing
Version Pattern 5
Tom Johnston, Randall Weis
DM Direct, August 2007

We noted in Part 8 that Version Pattern 5 is equivalent to Version Pattern 3, but with the single
episode constraint {SC 3-4} removed. However, in that same installment, we also pointed out a
difference in the first semantic constraint for each pattern. Constraint {SC 3-1} states that "if a
version for an object is followed by another version for that same object, the end date of the former
version must be one tick of the clock prior to the begin date for the latter version." But for multi-
episode objects, clearly, there may be any number of clock ticks between the logical delete
version for one episode, and the initial version of the next episode. We accommodated this change
by specifying that constraint {SC 5-1} has the same wording as {SC 3-1}, except for replacing "if a
version for an object ..." with "if a non-terminal version for an object ..."

On examining constraints {SC 3-5} and {SC 3-6}, we found no differences. These constraints
apply, without modification, to Pattern 5. This brings us to the last two constraints for Pattern 3. Do
they also apply to Pattern 5? Except for the minor rewording of {SC 3-1} and dropping {SC 3-4},
are the semantic constraints for Pattern 5 exactly the same as those for Pattern 3?

Chart of Version Patterns

Before continuing, we insert here the current chart of version patterns. References back to earlier
patterns are happening pretty frequently now, and we hope this will help the reader recall what
those prior version patterns were like.

Integrating Data: An Open Source Approach


Any technology deployed in the information system needs to
interoperate with existing applications or databases. Learn through
real-life scenarios how Open Source data integration solves the
interoperability challenge. Free White Paper.

1 of 6 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 9: ... http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

Figure 1: Chart of Version Patterns

The Importance of Semantic Constraints

Before we proceed, let's remind ourselves of the importance of explicitly listing the semantic
constraints for each pattern. There are two reasons for doing so. The first is that, together with the
schemas for each pattern - the columns used to implement the pattern - these constraints are the
formal and fully explicit definition of each pattern.

The scenarios show how each pattern works. The diagrams provide useful visual representations
of what is going on. But the constraints and the schemas are the patterns.

The second reason for explicitly listing semantic constraints is that we will have fully and correctly
implemented a version pattern if and only if a) we include the versioning columns defined for the
pattern in all tables we wish to version; b) we enforce each of these pattern-specific constraints;
and c) for all version patterns, we implement what I will call "temporal referential integrity."

Entity integrity constraints - the uniqueness of primary keys - are still enforced by the DBMS. But,

2 of 6 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 9: ... http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

as we shall see later on, we cannot use the DBMS to enforce referential integrity. When standards
committees stop squabbling and produce a temporal data management standard, and then when
vendors implement that standard, it is to be hoped that "temporal referential integrity" will be
enforced by the DBMS. But because we are still in the "do it yourself" period, as far as the
evolution of temporal database management is concerned, that's what we will have to do. We will
have to implement temporal referential integrity ourselves, and also the specific constraints that
define each version pattern. That means, in most cases, that we will have to write code. In general,
that code will be written as before-insert triggers.

But this is getting ahead of ourselves. For now, let's wrap up Version Pattern 5 by looking at its
last two semantic constraints.

Semantic Constraints {SC 3-7 and 3-8}

Constraints 5 and 6 apply to initial versions of objects. For Pattern 3, these are the versions that
implement the insert of a new object. For Pattern 5, these are the versions that implement the
insert of a new episode of an object. Inserting a new object is thus a special case of inserting a
new episode of an object. It is the special case of inserting an initial episode of an object.

Constraint 5 applies to the effectivity begin date, and constraint 6 to the logical delete date.
Constraints 7 and 8 apply to the same pair of dates, but implement versioned updates rather than
versioned inserts.

{SC 3-7} states the constraint on the effectivity begin date. It says of any noninitial, non-logical-
delete version of an object, that "its ver-beg-dt must be at least one clock tick past the ver-beg-dt
of the immediately prior version for the policy. If it is not, then the two versions will overlap for at
least one clock tick. But that is invalid because there cannot be two versions of the truth at any
given point in time."

This constraint also applies to Pattern 5, and for exactly the same reason. However, in reviewing
Pattern 3, we have discovered that constraint 7 is a corollary of constraint 1. Constraint 1 states
that "If a (non-terminal) version for an object is followed by another version for that same object,
the end date of the former version must be one tick of the clock prior to the begin date for the
latter version." This is the same thing as saying that "... the begin date of the latter version must
be one tick of the clock later than the end date for the former version." If it must be one tick of the
clock later, then it follows that it "must be at least one tick of the clock" later.

So constraint 7, first of all, should have said "exactly" instead of "at least." But in either form, it is
not a distinct constraint because it follows from constraint 1. If code implemented constraint 1,
there would be no need for additional code to implement constraint 7.

{SC 3-8} states the constraint on the logical delete date. It says, of any non-initial, non-logical-
delete version of an object, that "obj-end-dt must be null. If it is not, the policy is being deleted;
and, by assumption, this is not a delete transaction."

Version Pattern 5 has the same semantic constraint. It says that the object end date for a version
has a value if and only if its corresponding object is being logically deleted. Note that obj-end-dt is
a business date, not a date reflecting activity against the DBMS. (In general, when we characterize
a column as a "business" or "logical" or "semantic" column, we mean the same thing. More
precisely (since these terms emphasize different things, and thus are not truly synonyms), every
column described in these articles is either all three of these things, or none of them.)

3 of 6 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 9: ... http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

Infrastructure is a Hard Sell

More than one company has turned down the opportunity to implement versioned history as an
enterprise resource, as a data management layer on top of all production databases, and on
which a view layer can be built to hide its complexities. The attitude within IT seems to be that they
need to get on with the job of building or re-building new OLTP systems, warehouses, data marts
or ODSs (Operational Data Stores). If versioned history is required, they will just add a date to the
primary key and be done with it. It works like this.

Versioned history is usually introduced during a project to build some kind of data warehouse,
some kind of database for which there is a requirement to keep some amount of versioned history
(as defined in Part 1). The next time a database is being created or modified and versioned history
again becomes a requirement, it is implemented for that database. If a data mart is being built or
modified, one based on the dimensional model, and there is a requirement to provide not only
dimensions as they are currently, but also dimensions as they were at some point in the past, then
work is done and, eventually, the result is achieved.

This approach works, more or less. As time becomes increasingly important in databases, it will be
used more and more frequently. But it is the wrong approach, and results in a set of databases
with (a) hampered semantic interoperability; (b) little or no flexibility to accommodate additional
semantic temporal requirements; (c) high maintenance and enhancement costs; (d) a higher
proportion of serious (but often difficult to detect) semantic errors in the data; and (e) a database
whose time management functions must be understood and manipulated by both end user and
IT-developer query writers.

Let's consider each of these flaws in the ad hoc approach, one at a time.

1. Hampered semantic interoperability. An ad hoc, one-database-at-a-time approach might be


represented by a set of databases, each of which uses one of these version patterns.
Obviously, federated queries that range across two or more such databases will have to
reconcile the different implementations of versioned history. As we have seen, just by
considering these first five version patterns, that reconciliation is likely to be difficult and
error-prone.

2. Restricted flexibility. Suppose one of these version patterns has been implemented in a
database, and that later on, the business comes back and asks for additional versioned
history functionality. This is equivalent to asking for a new Version Pattern. If this involves
nothing more than rewriting insert triggers, the changes will be relatively easy to implement.
If it involves adding non-primary key columns to a versioned table, the changes will still be
relatively easy to implement.

But the change from Pattern 2 to Pattern 3, and another change that we will consider later on,
both involve changes to the primary key of the versioned tables. These changes are still easy to
implement, compared to changes in primary keys for non-versioned tables. The reason, as we will
see, is that referential integrity for versioned tables is not implemented by storing primary key
values in foreign key columns. This means that changes to the primary keys of versioned tables
affect those tables only and do not "ripple out" to affect all foreign key-related tables.

Nonetheless, the costs of incrementally adding temporal functionality are not negligible. Whenever
the schemas are changed, for example, all code that touches those schemas must be altered.
Although, as described in Part 1, only view-creating and physical update code will directly access
versioned tables, this can still be a significant burden when changes must be made. Far better to

4 of 6 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 9: ... http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

implement an "ultimate" versioned history solution, and then "turn on" specific functionality as and
when it is requested.

3. High maintenance and enhancement costs. High maintenance costs result from the need to
maintain different version-implementing codebases. High enhancement costs are just a
corollary of the previous point, restricted flexibility.

The main source of high enhancement costs is the cost of adding a date column to the primary
key of each table being converted to a version table, together with the (erroneous) belief that
foreign keys to the table being converted must also be changed. With primary key expansion, and
the belief that foreign keys must be maintained when a table is converted to a versioned tables,
foreign keys in all directly related objects must also be expanded. Thus, the cost of change is no
longer simply the cost of adding a date to a primary key, in a single table. Furthermore, when
foreign keys to the original table are themselves part of the primary keys in dependent tables, the
need to modify foreign keys ripples out to the tables dependent on those tables. This ripple effect
can continue indefinitely, and require changes to any number of tables. Another problem is that
complex "trickle" processes must be developed to replicate changes in versions and to replicate
terminating events to dependent objects.

4. Risk of errors. This is a special case of the previous point, high maintenance and
enhancement costs.

5. Risk of Misinterpretation of Query Results. In addition, there is the specific risk of errors in
interpretation. Because of the complexity of versioned history, it is quite easy to write queries
that return apparently valid results, but that are actually answers to a slightly different
question than the one the query author had in mind. This risk of misinterpreting query
results exists for all queries, of course. But we have found that it is an especially high risk
when temporal queries are involved.

The Biggest Risk: Internal Machinery Exposed to Query Writers with the Ad Hoc Approach

Temporal management functions are infrastructure. It's just that standards committees can't agree
on how to implement them, and so DBMS vendors hold back. This does not mean that temporal
management functions should be developed on the fly, one database at a time. It means that they
are infrastructure we must currently develop ourselves.

Although not necessarily so, every ad hoc implementation of versioned history that we have seen
exposes the temporal management infrastructure to both developers and business users of the
databases. It also asks modelers and developers to design and build perhaps the most complex
semantic patterns that databases are asked to manage. And the ability to do this well, let alone
the ability to do it correctly, is found only with the most advanced modelers and developers an IT
department has. For those IT managers who think that modeling and database development are
commodity skills, the management of temporal data is the strongest disproof possible.

Because temporal data management is infrastructure, it should be encapsulated within a data


access layer that hides its complexity from modelers, developers and business users alike, as
explained in Part 1. Just as designers and developers don't have to implement balanced-tree
access methods, but make use of them every time they design and build a relational database, so
too should they be able to use temporal data management without the internal machinery being
exposed to them.

Infrastructure projects are always a hard sell. By definition, they don't satisfy any particular need

5 of 6 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 9: ... http://www.dmreview.com/dmdirect/20070817/1090937-1.html?type=pri...

of a specific business unit. They are paradigmatically enterprise work, work whose value is
realized indirectly by the specific applications and databases that serve specific business users
and purposes. The management of time with today's relational DBMS products is a "do it yourself"
effort. For the reasons just articulated, it is work whose considerable cost should be born by the
enterprise.

One final note. We have noticed that when infrastructure work is proposed, and objections are
raised to it, at some point someone will say that what we are proposing is a "build it and they will
come" approach. And, of course, everyone knows that that is a bad idea. But with infrastructure
work, the only alternative to "Build it and they will come" is "get in, patch it, and get out as quickly
and cheaply as possible."

We believe that when an enterprise solution to temporal database management is built, they will
come. It is not the business that will come. But the business will make demands on developers
that require temporal data management, and those developers will come.

Earlier, in the numbered paragraph entitled "High Maintenance and Enhancement Costs," we
suggested that foreign keys don't work with versioned tables the same way they work with
"normal" unversioned ones. It is important to be very clear about this. A common
misunderstanding of versioning is that when a new version of an object is created, all foreign keys
that pointed to the original version must be cascade-updated to point to the new one. Indeed,
were it true that foreign keys work the same way in versioned and un-versioned tables, the
problems would get even more complicated.

So next time, in Part 10, we will stop and examine the use of foreign keys when versioned tables
are involved. We will see that DBMS-enforced referential integrity cannot be used. A new way of
using foreign keys and enforcing referential integrity is required. And once again, because
temporal database management is on hold as far as DBMS vendors are concerned, we will have
to take a "do it yourself" approach.

For more information on related topics, visit the following channels:

Data Management
Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

6 of 6 10/7/2008 9:29 AM
Time and Time Again: Managing Time in Relational Databases, Part 10 - ... http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 10 - Foreign Keys
in Version Tables
Tom Johnston, Randall Weis
DM Direct, September 2007

In trying to explain versioning to other data modelers and DBAs, we have found that many of them
are worried about the cascade update implications for versioned tables. In general, the concern is
that if a new version of an object is inserted into a version table, then all foreign key references to
the version it replaced must be updated to point to the new version. Furthermore, if those updates
take place in tables which are themselves versioned tables, then those updates create additional
new versions, and all foreign keys that point to them must also be updated. And so a single
versioned update can cascade out to affect many other versioned tables, ending only when a
cascade chain reaches either a) a non-versioned table or b) a versioned table which does not itself
contain any foreign keys.

These concerned modelers and DBAs correctly point out that this cascade effect can make
versioned updates very expensive. Furthermore, if there are any circular references, the cascade
will become an infinite loop. For example, consider this pair of tables: a Client Version table and
our Policy Version table. The Policy Version table contains a foreign key to the Client Version
table, and each row in the Client Version table contains a count of the total number of versioned
updates across all policies for that client. The infinite loop happens like this:

1. A new version of a policy is inserted. Call the old version P12 and the new version P13.
2. This triggers code that updates the policy version count in the related client, resulting in a
new version of that client. Call the old version C25 and the new version C26.
3. At this point, we must cascade update the policy version. P13's foreign key still points to
client version C25, and it must be updated to point to C26. This creates a new version of that
policy, P14.
4. But now the trigger executes once again, updating the policy version count in the related
client. This loops us back to step 2, where we update that count, creating client version C27.
5. But once again, we must cascade update the policy version. P14's client foreign key still
points to C26. This update creates a new version of that policy, whose client foreign key
points to C27. This new version of the policy is P15.
6. And so on, and so on, to either the last syllable of recorded time, or to a frustrated machine
operator who cancels the process.

Also, notice that so far, we have talked only about cascade updates. There are also, of course,
cascade deletes. As we will see later on, the semantic constraints surrounding cascade deletes in
versioned tables are not identical to those for cascade deletes of non-versioned tables.

1 of 7 10/7/2008 1:02 PM
Time and Time Again: Managing Time in Relational Databases, Part 10 - ... http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

Integrating Data: An Open Source Approach


Any technology deployed in the information system needs to
interoperate with existing applications or databases. Learn through
real-life scenarios how Open Source data integration solves the
interoperability challenge. Free White Paper.

But for now, we will concentrate on the cascade update problem.

Chart of Version Patterns

Before continuing, we insert here the current chart of version patterns. References back to earlier
patterns are happening pretty frequently now, and we hope this will help the reader recall what
those prior version patterns are.

2 of 7 10/7/2008 1:02 PM
Time and Time Again: Managing Time in Relational Databases, Part 10 - ... http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

Figure 1: Chart of Version Patterns

Chart of This Series

Before continuing, we insert here a chart listing the installments to date in this series and a brief
description of what each one is about. As this series grows longer, the back references to earlier
installments will only increase, and we think that having a chart like this at hand will be helpful.

3 of 7 10/7/2008 1:02 PM
Time and Time Again: Managing Time in Relational Databases, Part 10 - ... http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

Figure 2: Chart of Installments to Date in this Series

What to Do About the Cascade Update Problem for Versioned Tables

There are only three possible responses to the versioned table cascade update problem (or to any
problem). They are: a) ignore the problem, b) avoid it or c) fix it.

Ignore it. The first response is to ignore the problem. If we take this approach, it means that in
many if not most cases, a single versioned update will trigger any number of additional versioned
updates. Perhaps the attitude about multiple updates is: "DASD is cheap; CPU cycles are cheap.
So just let it happen." But then what about the possibility of an infinite loop? Perhaps the attitude
about infinite loops is: "Design your databases so there aren't any loops."

In some specific cases, this may be an acceptable response. But if our objective is to develop an
enterprise solution (which, indeed, is our objective), then this response just won't do. We need a
response that works in all possible cases, not just in special individual cases. We can't just ignore
the issue of cascade updates in versioned tables.

Avoid it. The second response is to avoid the problem. This means to turn off referential integrity
checking in the DBMS (which would stop the loop in the client to policy direction). If we can accept
cascade updating, and are only concerned to avoid the infinite loop, an alternative is to turn off the

4 of 7 10/7/2008 1:02 PM
Time and Time Again: Managing Time in Relational Databases, Part 10 - ... http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

developer-written code which updates counts (which would stop the loop in the policy to client
direction).

Avoiding the problem is not as cavalier as it sounds. Standards committees are at loggerheads
over how to introduce temporal management functions into relational DBMSs, and as a result,
DBMS vendors are stymied. Consequently, there is no adequate support for temporal database
management in current relational DBMSs. So we shouldn't be surprised that there is also no
DBMS-implemented referential integrity that works for temporal (versioned) tables. Until there is
such support, avoiding the problem by turning off referential integrity may be a very reasonable
response.

But isn't the proper conclusion, given DBMS shortcomings, to implement referential integrity with
our own code? Isn't the proper conclusion to fix it, not to avoid it?

That depends, of course, on a comparison of the costs and benefits of the two responses.
Specifically:

What costs do we incur if we fix the problem? We may assume that these are all IT costs,
not business costs. The answer is that we incur the development and maintenance costs
that are part of any IT software artifact.
What benefits do we garner if we fix the problem? We may assume that these are all
business benefits, not IT benefits. The answer is that a) we can use DBMS-supplied
referential integrity, and thereby gain the assurance that RI errors will not occur; and b) by
updating foreign keys whenever referenced rows are replaced by newer versions, the
relationships involving versioned tables will remain current.
What costs do we incur if we avoid the problem? We may assume that these are all business
costs, not IT costs. The answer is that a) since we cannot use DBMS-supplied referential
integrity, we must therefore write our own code to guarantee that integrity; and b) since
relationships among versioned tables are not updated if we avoid the problem, such
relationships gradually become more and more "out of date" as additional versions of a
foreign-key referenced object are created.
We may assume that there are no benefits to avoiding the problem (other than cost
avoidance).

Fix it. The third response is to fix the problem. To fix any problem, we need to start with a correct
diagnosis. We need to identify the root cause, the thing which, if fixed, will eliminate the problem.
We suggest that the root cause of the problem is the instability of foreign-key references when
those foreign keys point to versioned tables.

With "normal" (non-versioned) tables, cascade updates are rare because key changes are rare. If
we were dealing with a Client and a Policy table - instead of with a Client Version and Policy
Version table - then nothing we have considered so far would require a cascade update. But
because we are dealing with foreign key-related version tables, every new version in the
referenced table (Client Version) requires a cascade update in the referencing table (Policy
Version). Because we have chosen to create a new version in the referenced table every time a
new version is added in the referencing table (by incrementing a count), we loop back and create
an infinite cascade update problem.

We can fix the infinite loop problem quite easily. In the case of the example we are working with,
the infinite loop will be broken if a versioned update in the referencing Policy Version table does
not cause a new version to be created in the referenced Client Version table. That means that we
either don't keep a count of policy versions in the client, or else that updating that count is an

5 of 7 10/7/2008 1:02 PM
Time and Time Again: Managing Time in Relational Databases, Part 10 - ... http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

overwrite type of update, not one that requires a new version to be created.

Additional discussion of the infinite loop problem, and the fix for it, might provide additional insight
into the root cause of the overall cascade update problem. But we prefer to go after that additional
insight by turning to the other part of the problem, the cascade update from "parent" to "child,"
from the referenced table to the referencing table.

A cascade update happens when a primary key changes, and there exist one or more foreign keys
which reference it. So to fix a cascade update problem (not ignore it and not avoid it), we must
either a) eliminate primary keys, b) eliminate foreign keys or c) stabilize the link between them.

We can rule out the first two options immediately. They amount to conceding that the relational
model cannot handle versions and, more generally, cannot handle time. While it might be
interesting to pursue this concession, we have neither the time nor the inclination to do so here.

But what does the third fix mean? What does it mean to "stabilize" the link between a primary and
a foreign key? Let's start by looking at an example of these keys.

Figure 3: The Client Version and Policy Version Tables

In both tables, the primary key consists of two columns. The first column is an identifier for the
individual object, either a client or a policy. The second column, a date, distinguishes and
sequences the versions of each object. No two versions of the same object can have the same
begin date.

Note: in all examples up to this point, the foreign key in the Policy Version table consists of only a
single column - client number. Now that we are considering an issue involving foreign keys to
versioned tables, we have to drop that simplification and show the full foreign key.

What Does It Mean?

We said earlier that to understand how avoiding the problem might work, we need to answer to the
question, "What do foreign keys relate when they relate versioned tables?" The same thing is true
for the "fix it" option. To understand how to fix the problem, we need to answer the same question.

Next time, in Part 11, we will see that the "avoid it" and "fix it" options are based on two different
answers to the question, "What do foreign keys relate when they relate versioned tables?" Since
different answers are being given, the different responses based on those answers will very likely
entail different semantics supported by the answer. It is a cost/benefit analysis of these different
semantics which will determine whether we should continue using an "avoid it" or a "fix it"
response.

6 of 7 10/7/2008 1:02 PM
Time and Time Again: Managing Time in Relational Databases, Part 10 - ... http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

Editor's note: Watch for Part 11 in the October 5, 2007 issue of DM Direct.

For more information on related topics, visit the following channels:

Database Application Performance


Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

7 of 7 10/7/2008 1:02 PM
Time and Time Again: Managing Time in Relational Databases, Part 11 - ... http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

Time and Time Again: Managing Time in


Relational Databases, Part 11 - Foreign Keys
in Version Tables Continued
Tom Johnston, Randall Weis
DM Direct, October 2007

In Part 10, we presented a concern that many modelers and database administrators (DBAs) have
expressed about what happens to foreign keys when a new version of an object is created. We
said that "in general, the concern is that if a new version of an object is inserted into a version
table, then all foreign key references to the version it replaced must be updated to point to the
new version. Furthermore, if those updates take place in tables which are themselves versioned
tables, then those updates create additional new versions and all foreign keys that point to them
must also be updated. And so a single versioned update can cascade out to affect many other
versioned tables, ending only when a cascade chain reaches either a) a non-versioned table, or b)
a versioned table which does not itself contain any foreign keys."

We also noted that there is a potential cascade problem when an object is deleted. Note two key
phrases here: a) "when a version is replaced," and b) "when an object is deleted." Versions are
not updated; versions are replaced. It is objects which are updated in the process of one version
replacing a previous one. Also, versions are not deleted; objects are deleted by placing a deletion
indicator on the current version of the object, and also on any future versions on the database at
the time of the deletion.

Integrating Data: An Open Source Approach


Any technology deployed in the information system needs to
interoperate with existing applications or databases. Learn through
real-life scenarios how Open Source data integration solves the
interoperability challenge. Free White Paper.

Chart of Version Patterns

Before continuing, we insert here the current chart of version patterns. References back to earlier
patterns are happening pretty frequently now, and we hope this will help the reader recall what
those prior version patterns are.

1 of 7 10/7/2008 1:03 PM
Time and Time Again: Managing Time in Relational Databases, Part 11 - ... http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

Figure 1: Chart of Version Patterns

Chart of This Series

Before continuing, we insert here a chart listing the installments to date in this series, and a brief
description of what each one is about. As this series grows longer, the back references to earlier
installments will only increase, and we think that having a chart like this at hand will be helpful.

2 of 7 10/7/2008 1:03 PM
Time and Time Again: Managing Time in Relational Databases, Part 11 - ... http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

Figure 2: Chart of Installments to Date in this Series

What Do Foreign Keys Relate? The Issue

In non-versioned tables, foreign keys relate objects. Each row in a Client table represents a client.
Each row in a Policy table represents a policy. What a foreign key from a Policy table back to a
Client table does can be expressed in two different ways. First, we can say that it relates the row
representing that policy, which contains the foreign key, to the row representing the client for that
policy. Secondly, we can equally well say that it relates that policy to that client.

The first way of saying what the foreign key does refers to the data; the second way refers to the
things which that data represents. The first way, from the point of view of data management, is
physical; the second way is logical or, more precisely, semantic.

In non-versioned tables, there is a one-to-one pairing of things we are interested in (clients,


policies) with data that represents them (rows in tables). Because the mapping between what is
represented and its representation is one-to-one, we don't need to pay much attention to the
distinction. We can talk about what foreign keys relate in terms of either objects or rows.

Seldom dealing with versioned tables, modelers and DBAs are usually not very concerned with the
difference between data and semantics, between tables, rows and columns, and the types,
instances and properties that they represent. But when we are dealing with versioned tables, the
one-to-one mapping breaks down. It is only when there is exactly one row, representing the initial
version of an object, that the mapping appears to be one to one. After that, it is one object to many
versions of that object.

3 of 7 10/7/2008 1:03 PM
Time and Time Again: Managing Time in Relational Databases, Part 11 - ... http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

However, a more careful reading of the situation would make it clear that, even in this case of a
single version of an object, there is not a one-to-one mapping between object and row. For a row
which represents a version of an object, even when there is only one version of that object being
represented, does not represent the object. It represents a delimited section of the timeline of that
object. Each version explicitly says, "I represent this object as it was, beginning on this specific
date." Then (in the "ultimate" version pattern we are leading up to), the version also says either
"And I no longer represent this object after this specific end date", or else it says "I will continue to
represent this object 'until further notice'" (which, as we shall explain later, is not the same thing
as saying "I represent this object until the end of time").

Because of this breakdown of a one-to-one mapping between an object in the world and a row in a
table, talk about relating objects and talk about relating rows are not, as they are in the case of
non-versioned tables, effectively equivalent. Let's consider a more typical situation, say, one client
and five versions for that object in the Client Version table, and one policy related to that client,
and three versions for that object, in the Policy Version table. Here, it is clear that the one-to-one
mapping has broken down, both for clients and for policies. And here we will have to make a
choice between the two ways of speaking. Do foreign keys relate policies to clients, or do they
relate policy versions to client versions?

What Do Foreign Keys Relate? One Answer

Those modelers and DBAs who are concerned about the foreign key problem in versioned tables
assume that foreign keys relate rows which represent versions. A foreign key from a Policy Version
table back to a Client Version table relates one row to another, a policy version to its associated
client version.

If we want a justification for the assumption that foreign keys in versioned tables relate versions to
versions, it might go like this: we understand that the business is interested in clients and policies,
and which go with which. But we modelers and DBAs serve the business by managing data. If the
business analyst has gotten the requirements right, and if we have expressed them correctly, then
foreign keys relating rows to rows will accurately reflect the real-world relationships between
objects and objects.

An even shorter justification goes like this: the DBMS doesn't know anything about what's "out
there," about what the data it manages represents. So there is no way we can ask the DBMS to
relate objects; all it knows how to do is to relate rows to rows, and foreign keys are the method it
uses to do that. Versioned tables or non-versioned tables, it's all the same to the DBMS. Foreign
keys relate rows to rows.

But if we proceed on the assumption that foreign keys in versioned tables relate rows to rows, and
thus versions to versions, we can easily find ourselves dealing with the cascade update problem
described in Part 10. Specifically, whenever a new version is created for a client, then all the
policies whose foreign keys pointed back to the original client version must be updated to point to
the new client version. If this update is a versionable update, it will create a new version of the
policy. Then, any versions anywhere else across the database that pointed to the policy which just
received a new version will have to be updated to point to that new version. And so on, and so on.

In Part 10, we concluded that we could not ignore this problem. The reason we can't ignore the
problem is that we are looking for an enterprise solution, one that will apply to all sets of version
tables. It may be that in databases that contain only a few versioned tables, and/or databases
whose versioned tables change very infrequently, the performance penalty of cascading updates

4 of 7 10/7/2008 1:03 PM
Time and Time Again: Managing Time in Relational Databases, Part 11 - ... http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

might be manageable. But with an enterprise solution, we cannot count on that. With an
enterprise solution, some databases may be crippled by the performance demands of frequently
updated version tables with lengthy cascade chains.

So, on the assumption that we are relating versions to versions, we must somehow either fix the
problem or avoid it. How can we do that? As we said in Part 10, we could avoid the problem by
turning off RI. But DBAs just don't like to turn off RI. That opens the door to foreign keys that don't
point to anything. So, is there a way to fix the problem rather than avoid it?

There are two ways to fix the problem. One way is to make foreign key updates non-versionable
updates, i.e., updates in place, updates which overwrite the foreign key but which do not produce
new versions in the process. The second way is to not update foreign keys among versions. In
either case, RI among versions can remain on, but cascade updates are prevented.

Let's consider each solution in turn.

Solution 1. Make Foreign Key Updates Non-Versionable.

Figure 1 below shows the example we introduced in Part 10. Both the Client and Policy tables are
versioned tables. However, the inclusion of policy-type as a column of the Client Version table,
back in Part 10, was a mistake. But it turns out to be one we will need as this analysis continues.
So to distinguish it from the policy-type column on the Policy Version table, let's assume that
policy-type on client versions means something like "the type of policy which the client's sales
representative most recently tried to sell to the client." So let's rename it "upsell-policy-type."

Figure 1: The Client Version and Policy Version Tables

Let's assume that a new version of client C882 has just been created. This is shown in Figure 2
below. We must now update all foreign keys that pointed to [C882, 5/23/04], and make them point
to [C882, 10/01/04]. And if we make this a non-versionable change, then there is no cascade
problem. Instead, the result is shown in Figure 3.

5 of 7 10/7/2008 1:03 PM
Time and Time Again: Managing Time in Relational Databases, Part 11 - ... http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

Figure 2: Client C882 is Versioned

There is no cascade problem because, as Figure 3 shows, the foreign key change in the Policy
Version table was made as a physical update to C882's current policy version. No new version of
that policy was created.

Figure 3: Client C882's Policies are Updated

This is one way that we can keep foreign keys pointing to the current versions of the objects they
reference, and at the same time avoid the cascade update problem. But notice the drawback to
this solution: it loses information. The specific information that is lost is that, from 6/01/04 to
10/01/04, policy version [P138, 6/01/04] pointed to client version [C882, 5/23/04]. During that
period of time, a join from that policy version to its related client version would have shown an
upsell policy type of PPO. This is the information that is lost.

Why is it lost? Because we physically updated the foreign key in the policy version. We did that in
order to avoid the cascade update problem. But why does the foreign key need to be updated?

6 of 7 10/7/2008 1:03 PM
Time and Time Again: Managing Time in Relational Databases, Part 11 - ... http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

The answer we get to this question, from the modelers and DBAs we have spoken to, is that joins
should always join to the most current version of the object referenced. After all, with
non-versioned tables, foreign keys point to the most current version – which also happens to be
the only version.

In Part 12, we will continue this line of questioning, which will lead us to consider the second
solution to the cascade update problem, which is: don't update foreign keys that point to versions.
We will then conclude our discussion of foreign keys and versioned tables by considering the
other answer to the question, "What do foreign keys relate?” that answer being, "Objects, not
versions."

For more information on related topics, visit the following channels:

Data Management
Data Modeling

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

7 of 7 10/7/2008 1:03 PM
Time and Time Again: Managing Time in Relational Databases, Part 12: ... http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

Time and Time Again: Managing Time in


Relational Databases, Part 12: Foreign Keys
in Version Tables
Tom Johnston, Randall Weis
DM Direct, October 2007

In this article, and in Part 10 and Part 11, we have been discussing the issue of foreign keys and
referential integrity in versioned tables. The issue arises because most modelers and database
administrators (DBAs) seem to believe that if a change is made in a versioned table, and there are
foreign keys pointing to that table, those foreign keys must also be changed. If they themselves
are in a versioned table, and foreign keys in yet other versioned tables point to the rows they are
in, versioned updates will ripple out to those other tables, creating a cascade update problem.

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

In Part 10, we decided that our options are the options we always have for any problem. We can
ignore the problem, avoid the problem or fix the problem. We also determined that we cannot
ignore the problem because we are looking for an enterprise solution, one applicable to all
databases. For databases containing small and/or infrequently updated versioned tables, we
might be able to ignore the problem. But an enterprise solution must apply to all databases,
regardless of size or frequency of update.

In Part 11, we suggested that the most perspicuous line of investigation was to begin by asking
"What do foreign keys to versioned tables relate?" One answer is that they relate rows to rows.
The "related-to" rows are the rows in versioned tables. The "related-from" rows may be either
versions or non-versions. On the assumption that foreign keys relate rows to rows, whether or not
any of those rows are in versioned tables, we developed one solution to the versioned foreign key
problem. Here in Part 12, we develop a second solution to the versioned foreign key problem, also
based on the assumption that those foreign keys still relate rows to rows. We will then develop a
solution based on the assumption that foreign keys that point into versioned tables do not relate
rows to rows, i.e., versions to versions, but rather relate objects to objects.

We will conclude our discussion of referential integrity involving versioned tables by distinguishing
between the machinery of versioning and its semantics. The machinery is complex, and should be
as invisible as possible to developers and users. But the semantics are the semantics of temporal
databases, and once such databases are supported, those who write queries must understand
those additional semantics in order to write unambiguous queries against those databases.

Chart of Version Patterns

Before continuing, we insert here the current chart of version patterns. References back to earlier

1 of 8 10/7/2008 1:04 PM
Time and Time Again: Managing Time in Relational Databases, Part 12: ... http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

patterns are happening pretty frequently now, and we hope this will help the reader recall what
those prior version patterns are. They also give a hint of version patterns not yet discussed.

Figure 1: Chart of Version Patterns

Chart of This Series

Before continuing, we insert here a chart listing the installments to date in this series, and a brief
description of what each one is about. As this series grows longer, the back references to earlier
installments will only increase, and we think that having a chart like this at hand will be helpful.

2 of 8 10/7/2008 1:04 PM
Time and Time Again: Managing Time in Relational Databases, Part 12: ... http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

Figure 2: Chart of Installments to Date in this Series

Solution 2. Don't Update Foreign Keys to Versions

As explained in Part 10, this is the second solution to the cascade problem, given the assumption
that foreign keys to versioned tables relate versions to versions, if the foreign-key containing table
is also a versioned table, or else relate objects to versions, if the foreign-key containing table is not
a versioned table. The first solution is to update foreign keys in versioned tables, but to do so as
updates in place, updates which overwrite the foreign key but which do not produce new versions
in the process. The second way, discussed now, is to not update foreign keys that point to
versions.

Figure 3 shows the example we introduced in Parts 10 and 11. Both the Client and Policy tables
are versioned tables.

Figure 3: The Client Version and Policy Version Tables

3 of 8 10/7/2008 1:04 PM
Time and Time Again: Managing Time in Relational Databases, Part 12: ... http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

Let's assume that a new version of client C882 has just been created. This is shown in Figure 4.
With this “don’t update” solution, no further physical activity against the database is required. In
particular, we do not update the foreign key that points to [C882, 5/23/04].

Figure 4: Client C882 is Versioned

Now let's insert a new policy and see what happens. The result of this insert is shown in Figure 5.

Figure 5: Client C882 Gets Another Policy

At first sight, Figure 5 is puzzling. Policy P138 points to the first version of client C882, but policy
P704 points to the second version. Why is there this difference?

The difference is in the appearance of things only. It is, in fact, the result applying the following
rule.

Figure 6: Rule for Managing Foreign Keys in Versioned Tables

The version of the client that was current when policy P138 was created is the first version. The

4 of 8 10/7/2008 1:04 PM
Time and Time Again: Managing Time in Relational Databases, Part 12: ... http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

version current when policy P704 was created is the second version. So with this rule, every
foreign key pointing to a versioned table points to the version of the related object that was current
when the foreign-key containing row was created. This rule applies both when the foreign key is a
column in a versioned table, and also when it is a column in a nonversioned table.

Joins using such versioned foreign keys show what things look like, across joins, at the moment in
time that the row from which we are joining was created. Therefore, using this rule, we can always
follow a chain of foreign keys out from a particular row of interest, and bring together versions of
the related objects as they were when that original row was created. We can, in other words, bring
back data which is equivalent to a snapshot of that data taken at the moment the original row was
created.

But sometimes that isn’t what we want. Sometimes, we want to join to the most current version of
the related object. For example, consider a query which takes policy version [P138, 6/1/04], and
joins it to the client who owns that policy. If we use the foreign key in that version, it will join to
client version [C882, 5/23/04].

If this query was executed anytime between 5/23/04 and 9/30/04, it returned an upsell policy type
of PPO. And during that period of time, this is the correct upsell policy type. But suppose this
same query was executed on or after 10/1/04. In that case, the query returned an incorrect result,
because from 10/1/04 forward, the current upsell policy type was HMO.

What are we to make of this? First of all, it is clear that this way of managing foreign keys in
versioned tables will always produce correct results provided that, implicit in every join is the
assumption that joined-to versions will be the version current when the joined-from row was
created. But if this assumption is not valid, and in particular when what is desired is joins that
return the then current versions of all joined-to objects, then using the foreign keys in versioned
tables to make those joins is not guaranteed to return correct results.

How, then, can we write these other queries, the ones that want the current version of all joined-to
versioned objects? The way we can do this is to write a query which specifies the joined-to object
identifier – client number in this case – together with a clause that specifies the greatest version
begin date not greater than the current date, and containing a {null} object end date. The “not
greater than the current date” qualifier is needed to filter out future versions, versions that are not
yet in effect at the time the query is executed. The "containing a {null} object end date" qualifier is
needed to filter out an otherwise qualifying version that is for an object that has been deleted.

Something else that we must be aware of is that even though foreign keys to versioned objects
point to rows current at the time the containing row is created, this does not guarantee that using
these foreign keys will always return versions current when the joined-from row was first created.

Here's how this apparent anomaly can happen. Suppose that we also have an Employer table,
which is also versioned. This table contains the employers of our clients. Now suppose we have a
query which selects a policy, and then joins first to the client version current when it was created,
and from there to the employer version current when that client version was created. This query
may return employer data that was not current when that policy version was created. Figure 7
illustrates this situation.

5 of 8 10/7/2008 1:04 PM
Time and Time Again: Managing Time in Relational Databases, Part 12: ... http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

Figure 7: “Current When” Joins Do Not Always Retrieve Contemporary Versions

This join will return [E721, 11/15/01], even though the version of employer E721 that was current
on 6/1/04 was [E721, 5/15/04]. Supposing that the third column is employer data retrieved in the
query, that query will show “abc” as the "current" employer value, whereas “xyz” was the value
current when the policy version was created. These situations can arise because “current when” is
not a transitive relationship. This is illustrated in Figure 8.

Figure 8: “Current When” is not a Transitive Relationship

Assumption 2. Foreign Keys in Versioned Tables Relate Objects to Objects

The first answer to the question, "What do foreign keys in versioned tables relate?” was that they
relate rows to rows. On that assumption, we found that there were two solutions to the cascade
update problem.

The second answer to that question is, “Foreign keys in versioned tables relate objects, not
versions.” But we have to understand what this means since, in a physical database sense, it is
obvious that foreign keys relate rows to rows. And in the case of our examples, since both the
Policy and Client tables are versioned tables, foreign keys relating these two tables must relate
rows to rows – in other words, versions to versions.

But in a semantic sense, our ability to relate versions to versions, given current RDBMS
technology, is limited. If we don’t update foreign keys when the versions they point to are

6 of 8 10/7/2008 1:04 PM
Time and Time Again: Managing Time in Relational Databases, Part 12: ... http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

superseded, then the relationships implemented by those foreign keys slowly “age.” On the other
hand, if we do update (overwrite) foreign keys when the versions they point to are superseded,
then the relationships established when those rows were first created are no longer stored in the
database. Our only other option is to create a new version each time a foreign key changes. But
that option creates the cascade update problem we are trying to avoid.

On assumption 2, we may as well store only the object identifier component of a foreign key to a
versioned table. In our example, we may as well store only client number, and not bother with
version begin date. If we want to join a policy to the client version current when the policy was
created, the WHERE clause will ask for the version of that client whose begin date is the greatest
date less then or equal to the begin date of the policy version, and whose object end date is {null}.

On the other hand, if we want to join a policy version to the most current version of the related
client, the WHERE clause will ask for the version of that client whose begin date is the greatest
date less then or equal to [now], and whose object end date is {null}.

One would hope, of course, that a versioned client would not be version-deleted (have a non-null
object end date) if it still had at least one non-deleted versioned object related to it. If we were
dealing with non-versioned objects, the DBMS would guarantee this by either carrying out a
cascade delete, or by denying the delete until all dependent objects were first deleted.

But with versioned objects, the DBMS cannot guarantee this because such temporal semantics
are not part of any current RDBMS. So since we must guarantee it ourselves, there is always a
chance of error. To guard against that error, the WHERE clause we are discussing must also
specify that the client version whose begin date is the greatest date less then or equal to [now]
also has an object end-date of {null}.

Hiding the Complex Semantics of Time

We have already mentioned that managing time in relational databases is perhaps the most
difficult challenge we have ever encountered as analysts, modelers and DBAs. This discussion of
the semantics of versioned foreign keys certainly emphasizes this point.

It is clear that the mechanisms which manage time in relational databases must be hidden from
developers and users alike. One reason is the complexity of those mechanisms. Another reason is
that the management of time is not an applications- or subject area-specific requirement any more
than the management of dynamic balanced trees is an application- or subject area-specific
requirement.

The optimal solution will be one provided by vendors and specified by standards committees. The
mechanics may differ from one vendor to another, as long as the syntax and semantics of both the
SQL DML and SQL DDL are supported. But a SQL with temporal data management capabilities
must first be proposed and then accepted by the standards committees. The proposal was made,
almost a decade ago, but the acceptance seems no closer today than it was back then.

So in the meantime, we must provide our own encapsulation of this complexity, our own
enterprise-level solution to temporal database management. That is what we are doing in this
series of articles.

Being Aware of the Complex Semantics of Time

7 of 8 10/7/2008 1:04 PM
Time and Time Again: Managing Time in Relational Databases, Part 12: ... http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

There is one thing about temporal database management that cannot be hidden, however. It is the
fact that with such databases, new temporal semantics are available. This means that anyone who
writes queries must understand these semantics, and code his queries accordingly. For example,
a query asking for a join of a policy version to a client version raises the question, “Which client
version?” A query asking for a join of a policy to a client raises no such questions.

If the query writer does not understand the possibilities, and therefore does not code for them,
then either the DBMS will tell him that his query is ambiguous, or it will execute the query using
whatever default behavior has been defined. In the former case, the query writer needs to
understand what the error message means, and then write a new query that makes his temporal
assumptions explicit. In the latter case, the query writer had better understand what the issues
are, and what the default behavior is, or else he may ask one question with his query, but receive
an answer to a different question in the result set returned by the DBMS.

The basic point is this. The mechanics of temporal database management can and should be as
invisible to developers and other query authors as possible. But it is still necessary for any query
author to understand temporal semantics, because he cannot unambiguously say what he means
unless he does understand those semantics. No matter how elegant and expressive temporal SQL
will prove to be, it cannot relieve query authors from the obligation to understand the options
involved, and to be careful to code queries which say exactly what they mean.

The Versioned Table Foreign Key Position, Going Forward

In Part 13, we will take up Version Pattern 6, the bi-temporal versioning of recurring objects. When
we do, we will handle foreign keys to versioned tables by capturing the full foreign key of the
version current as of the time the row in question is inserted, both object identifier and version
begin date. This will be done whether the row being inserted is being inserted into a versioned or a
non-versioned table. The issue is the table pointed to, and what to do when that table is
versioned. Thus, in terms of our example, we will continue to record both client number (the object
identifier) and version begin date as a foreign key in the Policy table.

For more information on related topics, visit the following channels:

Data Modeling
Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

8 of 8 10/7/2008 1:04 PM
Time and Time Again: Managing Time in Relational Databases, Part 13 -... http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 13 - A Roadmap of
the Rest of This Series
Tom Johnston, Randall Weis
DM Direct, November 2007

We are nearly ready to discuss the penultimate pattern, Version Pattern 6. This pattern expresses
the combined semantics of all the version patterns we have discussed so far. The last versioning
requirement we will examine adds the requirement to version error corrections, and it is Version
Pattern 7 that will accommodate that requirement, as well as all the patterns previous to it. Version
Pattern 7 is therefore the “ultimate” versioning pattern to be discussed in these articles.

It is clear that if Pattern 6 can express the semantics of patterns 4 and 5, it can express the
semantics of all prior patterns. And indeed it can.

Intuitive Data Reporting & Dashboards


Get instant visibility into your business with clickable access to your
data no matter where it resides. Customize reports & dashboards on
the fly.Try PivotLink now.

Like Pattern 4, it distinguishes business effectivity dates from database activity dates, making it
possible, for example, to create new versions in advance of the date on which they will become
effective.

Like Pattern 5, it manages versions which may have temporal gaps between them, making it
possible, for example, to record the reappearance of an object after its absence for some period of
time.

But before we discuss Pattern 6, there are two preliminary matters to address. The first is to
provide a roadmap to the rest of the articles in this series. The second is to introduce the use of a
special set of surrogate keys for versioned tables, and explain why they are needed and how they
will be used.

The rest of this article lays out that roadmap. Then, Parts 14 & 15 make the case for using
surrogate keys, and describes the special considerations needed to make surrogate keys work
with versioned tables.

Chart of Version Patterns

Before continuing, we insert here the current chart of version patterns. References back to earlier
patterns are happening pretty frequently now, and we hope this will help the reader recall what
those prior version patterns are. They also give a hint of version patterns not yet discussed.

1 of 9 10/7/2008 1:05 PM
Time and Time Again: Managing Time in Relational Databases, Part 13 -... http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

Figure 1: Chart of Version Patterns

Chart of This Series

Before continuing, we insert here a chart listing the installments to date in this series, and a brief
description of what each one is about. As this series grows longer, the back references to earlier
installments will only increase, and we think that having a chart like this at hand will be helpful.

2 of 9 10/7/2008 1:05 PM
Time and Time Again: Managing Time in Relational Databases, Part 13 -... http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

Figure 2: Chart of Installments to Date in this Series

Roadmap for the Remainder of the Series

The high-level view of the roadmap to the rest of this series is as follows. Items are listed in the
sequence in which they will be discussed.

1. Versioning with surrogate keys.


2. The physical implementation of Version Patterns 1 – 5.
3. Version Pattern 6. Semantics and implementation.
4. Managing recurring objects.
5. Version Pattern 7. Semantics and implementation.
6. Star Schema versioning.

3 of 9 10/7/2008 1:05 PM
Time and Time Again: Managing Time in Relational Databases, Part 13 -... http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

7. Snapshots.
8. Sandbox versioning.
9. Hierarchical versioning.
10. Performance considerations.
11. Migrating to an enterprise versioning solution.
12. Notes towards a standards-based and vendor-implemented solution.

Let’s briefly consider each of these items.

Versioning Using Surrogate Keys

Surrogate keys are primary keys whose values are system-created and free of any business
content. Although Dr. E. F. Codd’s Information Principle recommends the use of business data
rather than surrogate keys as primary keys, there are both theoretical and practical reasons for
using surrogates as primary keys.

The distinction we have already made, between objects and their versions, is best represented,
when surrogate keys are used, by a pair of such keys. One of them we will call an “object
identifier” (“object id,” abbreviated as OID), and the other a “version identifier” (“version id,”
abbreviated as VID). These matters will be the focus of the Roadmap Item #1 discussion.

The Physical Implementation of Version Patterns 1 – 5

In earlier articles, we noted that each of the first five version patterns might be an appropriate
solution in a specific versioning situation. Those articles discussed the semantics which each
pattern supports, and described each pattern in what we will call logical data model (LDM) terms.
With Roadmap Item #2, we will provide the SQL DDL and DML that define the corresponding
physical database schemas, and that provide access to them. We will call this the physical data
model (PDM) representation of these patterns.

This clears the deck by completing our discussion of the first five patterns. It's time, then, to move
on to more advanced patterns.

Version Pattern 6: Semantics and Implementation

In semantic terms, this pattern is the logical union of the semantics of all preceding patterns. After
demonstrating that this is so, we go on to a PDM discussion, which demonstrates how to actually
build tables that conform to this pattern. Also, with this pattern, we switch from business keys to
surrogate keys. This will be the focus of the Roadmap Item #3 discussion.

Managing Recurring Objects

Recurring objects are anything that persists over time, are recognized by us, go away, and later
on come back. A policy that has lapsed, but then is renewed, is a recurring object.

But the most important kind of recurring objects, for businesses, are customers – people or
organizations who purchase goods or services from us. The principal reason that recognizing
returning customers is so important is human psychology. Returning customers want to be
greeted like old friends. They don't want the standard "We're happy to welcome you as a
customer of ours" greeting. They want a "Welcome! It's great to have you back!" greeting.

4 of 9 10/7/2008 1:05 PM
Time and Time Again: Managing Time in Relational Databases, Part 13 -... http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

Another reason it is important to recognize returning customers is so we can use the valuable
knowledge we have gained about them from their previous history with us, particularly the
knowledge of what kinds of products or services they tend to purchase from us. If we're a
company really on its toes, we may even have captured the information about why they left us in
the first place. All this information is far too valuable to lose just because the IT Department tells
upper management that "the match problem is difficult"!

From the point of view of temporal database management, however, it doesn't matter whether the
returning/recurring object is a person or a policy, a customer or a competitor. The point is that
when a new row is inserted into a versioned table, it is inserted as the next version of an object for
which we already have prior versions, even though there is a gap in time between when the object
was terminated and when this new version is inserted. These gaps mean that we can recognize
logical groupings of versions, one group for each continuous set of versions, with gaps
distinguishing the groups. These groups are what we called "episodes" back in Part 5.

We already know how to record the reappearance of a returning object in a versioned table. But
there are two related issues, one of which we will discuss, and one of which we will not. The issue
we will not discuss is how to recognize a returning object as an object we have seen (and
recorded in our databases) previously. In a non-temporal form, this is identical to the issue of how
to recognize that data from two different input streams, very often with different and incomplete
identifying data, are indeed for the same object. The issue, in either its temporal or its
non-temporal form, is "match logic,” and it differs so much from business to business, and object
type to object type, that nothing useful could be said of the general match logic problem in much
less than a book-length treatment.

The issue we will discuss is what to do when we handle reappearance incorrectly. The more
general issue is what to do when our match logic makes the wrong choice. The first wrong choice
is to say that we have a recurring object (a match to an object already in our databases) when in
fact we do not. The second wrong choice is to say that we do not have a recurring object when in
fact we do. The third wrong choice is to say (correctly) that we have a match, but to identify the
wrong object to match it with.

In the first case, once we identify our mistake, we have to split the incorrect match, separating data
identified as being for one object into two sets of data, one for each of the objects we now
recognize as distinct. In the second case, once we identify our mistake, we have to merge the data
for the two objects into data for a single object, either letting one of the two objects "survive" the
merge, or else creating a new third object to merge the first two into. In the third case, we have to
do both.

This is not a versioning problem, but it is a problem that inevitably arises when we are required to
(try to) correctly recognize recurring objects. When that requirement is combined with the
requirement to version the objects in question, we get a situation in which error correction
becomes a very complex problem. So the relevance of this issue to versioning is that if both
recurring objects and error correction are requirements, the problem of correcting match logic
mistakes becomes in part a versioning problem.

This complex problem is the subject matter of Roadmap Item #4. But most of the analysis of this
problem is relevant whether or not the target tables being affected are versioned or not. So
because the requirement to recognize recurring objects is being imposed on more and more
business application systems, and because the work required to correct the inevitable match
mistakes is so complex, we think it is warranted for us to branch off from our core topic of

5 of 9 10/7/2008 1:05 PM
Time and Time Again: Managing Time in Relational Databases, Part 13 -... http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

versioning, and consider the related topic of correcting object matching mistakes.

Version Pattern 7: Semantics and Implementation

With Roadmap Item #5, we present Version Pattern 7, our “ultimate” version pattern. Because of
its importance, we will provide a more thorough and detailed discussion of it than we have of the
prior version patterns.

In particular, we will provide over a dozen usage scenarios for this pattern, whereby “usage
scenario” we mean something like a use case, except that a workflow of separate processing steps
is not part of a usage scenario. Our usage scenarios focus on those events in which one or more
versioned tables are either updated from source transactions, or queried with SQL.

For each usage scenario, we will discuss both the updates required to get the retrieved data into
the database, and the queries required to get that data out. An LDM level discussion will be
followed by the physical schemas, transformation code and SQL which constitute the PDM
discussion. Specifically, we will cover the following points:

Sample tables, in their logical form, as already used with the earlier patterns. In addition, sample
physical tables, and the DDL used to generate them. All sample tables will contain instance data
sufficient to illustrate the scenario being presented.

Timeline diagrams for each scenario. These are the same diagrams that have been used to
explain earlier version patterns.

The semantics of each update scenario. These semantics will be clearly stated. In addition, the
transforms from source transaction to versioned table updates will be documented, either as code
or as “logically tight English.” These transforms provide one aspect of the encapsulation of
versioning shown in Figure 2 of Part 1.

The semantics of view creation. This will include SQL view creation statements, and their
instance-level results. These views provide the complementary aspect of the encapsulation of
versioning shown in Figure 2 of Part 1.

The semantics of temporal queries against these views. The objective here is to maximize the
simplicity of the query, while keeping all the temporal semantics available for the query author.

Star Schema Versioning

Star schemas are the most common data model for fact/dimension databases, often referred to as
"data marts.” But versioning doesn't apply to fact tables because fact tables are about events, and
events don't change. Events happen, and then they are over with. Versioning tracks change over
time, and only objects which persist over time can change over time. They change by participating
in events. In the Roadmap #6 articles, we will discuss the semantics – indeed the ontology – of
events and persistent objects, in order to understand this important point more clearly.

However, the dimension tables of fact/dimension databases do represent persistent objects.


Indeed, they represent the very persistent objects whose properties or relationships are altered by
their participation in the events recorded in their fact tables. Because dimensions represent
objects that change over time, versioning is applicable to them. The Roadmap #6 articles will
provide both a logical and a physical description of the use of versioning in fact/dimension

6 of 9 10/7/2008 1:05 PM
Time and Time Again: Managing Time in Relational Databases, Part 13 -... http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

databases.

Snapshots

In Part 1, we provided a complete taxonomy of ways in which historical data is maintained in


business databases. In particular, we distinguished snapshots from versioning. Although this
series has focused on versioning, snapshots also have their place in the management of temporal
data. Roadmap Item #7 discusses snapshots and their uses.

Sandbox Versioning

Often, application systems need to accumulate updates over a period time, and across several
physical units of work (UOW) that represent a single logical business transaction. Not until all
updates are applied, and the results of the complete set of updates validated, are those results
made visible to database end users. In other words, the updates are accumulated from several
different UOWs, but are then assembled into a single logically atomic transaction, i.e., one in
which the updates either all succeed or all fail, and in which the state of the database while the
updates are being applied is not visible to end users.

One common way of implementing the requirement to create a single, multi-transaction atomic
update out of updates from several distinct UOWs is to copy the database schema and store the
results of the updates in this copy. Once all updates have been applied, and all results stored in
the copy database, those results are then validated and either approved, or re-applied until the
results are all correct. At that point, all the data is moved from the tables in the database copy to
the corresponding tables in the live database.

This way of implementing a multiple UOW to single atomic update requirement is often costly.
Redundant objects and processes may be needed to validate the data stored in the copy
database. In addition, it may require significant amounts of system resources to marshal this data
into and out of the copy database.

Sandbox versioning will be discussed when we reach Roadmap Item #8. We will explain how the
requirement we have described can be met without the need for a copy database, and also
without the need for intra-database copies of the updated tables for these pending updates.
Instead, everything is done within the original tables themselves. When the results have been
validated and approved, the results become visible to the database user almost instantly,
regardless of how many rows across how many tables have participated in this "mass update."

Sandbox versioning functionality also provides the basis for the “check-out” and “check-in”
capabilities of hierarchical versioning, which is described next.

Hierarchical Versioning

The versioning we have been considering, up to this point, we may call “linear versioning.” Except
for error corrections, versions of the same object follow one after the other, with or without
intervening gaps, but never overlapping. They thus constitute a linear sequence of versions, each
the truth, and the only truth, about the object during the version’s effectivity period of time.

But businesses also have a need for hierarchical versioning. As the name suggests, this is
versioning in which one or more versions are both a) “child” versions of the same “parent version,”
and b) “valid,” in some sense, in concurrent and/or overlapping periods of time. IT professionals

7 of 9 10/7/2008 1:05 PM
Time and Time Again: Managing Time in Relational Databases, Part 13 -... http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

deal with this kind of versioning when they use software that allows them to “check-out” and
“check-in” copies of a base object.

Roadmap Item #9 discusses hierarchical versioning.

Performance Considerations

Other things being equal, a versioned table of an object (i.e., an entity type) will not perform as
well as a non-versioned table of the same object. Several factors account for this, including the
following ones:

Versioning combines current and non-current versions of an object in the same physical table, and
in general, the non-current versions will account for most of the rows and most of the size of the
table.

As explained in Parts 10 – 12, temporal joins are not directly supported in current RDBMSs.

Necessarily, dates are used to carry much of the semantics of temporal distinctions among
versions. In general, this means that to retrieve a version current as of a specific date, the
versioned table must be searched for first the object, and then for the version possessing the
unique pair of effectivity dates that frame the specific date.

Roadmap Item #10 presents some real-world production database metrics for updating and
querying versioned tables. They also include some “tips and tricks” for performance-tuning, based
on several years’ experience we have had in implementing and supporting relational databases
which contain versioned tables.

Migrating to an Enterprise Versioning Solution

Starting from scratch is easy. It’s like bringing a ship into dry dock for major repairs. But in
business IT, we seldom have that luxury. Usually, we are required to carry out major repairs while
the ship is still at sea. The articles in this series, up to Roadmap Item #10, will serve you well if you
are able to design one or more versioned databases from scratch, before they begin to be
populated with production data. Otherwise, your will have to face the challenge of introducing
versioning into production databases.

From a theoretical point of view, this is a special case of the “schema change” problem, one in
which the schema changes are from non-versioned to versioned schemata. (See Bibliography.)
From a practical point of view, the challenge is even more difficult than that; and the reason is that
the schema changes cannot be done all at once, against a database that has been taken off-line,
schema-changed, instance-updated, and then brought back on-line. That would be the dry dock
approach to schema change.

Our challenge, in the world of production databases, is to carry out schema changes while still “at
sea.” The Roadmap Item #11 articles will present our recommendations for implementing schema
changes in already productionalized databases.

Notes Toward a Standards-Based and Vendor-Implemented Solution

We lack a vendor implementation of anything more than elementary functionality for temporal
database management. The reason is that the SQL Standards Committee, and the British

8 of 9 10/7/2008 1:05 PM
Time and Time Again: Managing Time in Relational Databases, Part 13 -... http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

committee in particular, rejected the original temporal database proposal made several years ago
by Dr. Richard Snodgrass.

We are not aware that any progress has been made towards resolution of this issue by the
standards committees. For those interested in what the differences are, each side has written a
book explaining their position. These are the two books referenced in Part 1, one by Dr. Richard
Snodgrass and the other by Chris Date, Hugh Darwen and Nikos Lorentzos.

With Roadmap Item #12, we will attempt to find common ground between the two positions, and
also describe implementation issues which we believe should be taken into consideration when
designing this standard.

Roadmap Wrap-Up

In summary, we anticipate that a sufficiently detailed coverage of these eleven topics will require
thirty to thirty-five more articles. At two articles a month, that means that this series will continue
for approximately another fifteen to eighteen months.

As always, the authors welcome questions and comments from readers. This input can help us
identify issues we have not discussed clearly enough, or objections we have not considered at all,
and we can then clarify the issue, or respond to the objection. As always, we can be reached via
the indicated email addresses, and also at our respective websites.

Note: like many authors in this field, we have begun to use the word "semantics" quite frequently. And, like most of them,
we haven't bothered to define the term before using it. Of course, one way to learn a new term is "by osmosis" as it
were, to simply see it used in enough sentences that one gets a "feel" for how to use it. This, in fact, is how we all learn
nearly all the words we use. Googling is a way to accelerate this kind of learning.

But some words are important enough and specialized enough that we should supplement this
bootstrap approach to learning by studying a definition and the discussion that accompanies it. To
that end, one of us (Johnston) has recently added just such a definition with discussion to his
Web site, MindfulData.com.

Reference:

1. E. Rahm and P. A. Bernstein. (December 2006.) An Online Bibliography on Scheme


Evolution. SIGMOD Record 35(4), 30–31.

For more information on related topics, visit the following channels:

Data Modeling
Databases
DW Basics
DW Design, Methodology

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

9 of 9 10/7/2008 1:05 PM
Time and Time Again: Managing Time in Relational Databases, Part 14 - ... http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 14 - Versioning
with Surrogate Keys
Randall Weis, Tom Johnston
DM Direct, November 2007

Beginning with Version Pattern 6, we will use surrogate version keys in all versioned tables. But
note that the phrase “surrogate version keys” should be read as “surrogate keys for versions,” and
not as “keys for surrogate versions.” We are recommending the use of surrogate keys for
versioned tables. We are not introducing a concept of “surrogate versions.”

Neither the chart of version patterns nor the chart of this series are included in this article. The
reason is that this discussion of surrogate keys is relatively self-contained and makes few
references to earlier articles in this series.

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

We begin by defining our terms.

Surrogate and Other Kinds of Keys: Definitions

Primary Key. The column or columns of a table in a relational database that the database
management system (DBMS) recognizes as a unique identifier for that table, and as the column or
columns whose values will be duplicated in foreign keys in order to create relationships.

Comment: The second half of the definition, the part following the "and," is necessary. The reason
is that a DBMS is able to recognize and accommodate more than one unique identifier for a table.
Any column or columns over which a unique index is defined is as much a unique identifier for a
table as is the primary key. Once one of those columns or groups of columns is designated the
primary key, the DBMS knows that their values will appear in foreign keys whenever a relationship
to their rows is created.

Foreign Key. The column or columns of a table in a relational database that the DBMS uses to
establish a relationship between that table and another (not necessarily distinct) table. At the
instance level, foreign keys relate pairs of rows such that each foreign key value is identical to the
primary key value of a row in the referenced table.

Surrogate Key. A primary key whose values are system-generated and are either hidden from
business users or exposed with the understanding that no meaning other than their function as
unique identifiers should be assumed true of them.

1 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 14 - ... http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

Comment: the second half of the definition, the part following the "and," is necessary. The reason
is that system-generated primary key values may be exposed to business users. A typical example
is system-generated document numbers for invoices, purchase orders, receipts, etc. The exposure
does not mean that those primary keys are no longer surrogate keys. But, if exposed, such keys
must be exposed to business users with the clear understanding that their only role is to identify,
and that no other meaning can reliably be given to them.

Because this has turned out to be an excellent example with which to illustrate the concepts of
meaning and semantics, the next section will be devoted to elaborating on it. This is an aside to
our main theme of introducing the use of surrogate keys with versioned tables, but we think it is an
important one.

Business Key. A unique identifier for a table, all of whose columns express business meaning.

Natural Key. A primary key which is a business key.

Semantics and Meaning, Illustrated With a Surrogate Key Example

Speaking of surrogate keys, we said above that "… if exposed, such keys must be exposed to
business users with the clear understanding that their only role is to identify, and that no other
meaning can reliably be given to them."

The word "reliably" is essential here, because sometimes additional meaning is imputed to
surrogate keys by business users, whether we IT folks like it or not. For example, perhaps many
users of Acme's Policy table know that, following the "P" in its values, the three-digit number is a
number that was sequentially assigned by the system. Knowing that, these business users may
then use that knowledge to conclude, for example, that of two clients, the one with the higher
numeric part in the client-nbr was added to the database later than the other one.

In this way, such business users are imputing meaning, over and above the meaning of the key as
a unique identifier, to the values that key takes on. The additional meaning, or semantics, is that
the surrogate key also chronologically sequences the events in which the clients or policies were
added to Acme's tables. Those key values "tell us" the sequence in which rows of that table were
created, and it is in their "telling us that" that their imputed semantics consists.

While IT cannot stop users from reading these semantics into these keys, we can and must
explicitly state that such imputed semantics are not reliable. We must tell the users that if they
continue to do this, they do so at their own risk. Let's see, now, what kind of trouble such users
can get themselves into.

Acme's Policy Management System initially did start out by creating primary key values using a
sequence number for the last three characters in the key. Although many business users knew
this, what they did not know is that policy numbers, starting about a year ago, were being reused.
Whenever an Acme policy was deleted, a six-month waiting period was observed (to allow all
references to the deleted policy to be "flushed out" of the on-line database), after which the policy
number was used as the unique identifier of a different policy. Obviously, the reuse of policy
numbers makes them useless for telling which of two policies was created before the other one.
This illustrates the risks of relying on undocumented semantics.

As semantically risky as such reuse of identifiers is, most of us who have been around for awhile
have seen this kind of thing happen. In our case, Acme apparently thought that it would never

2 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 14 - ... http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

have to manage more than a thousand policies. A very modest way to start out, but also a little
shortsighted. And this shortsightedness is the cause of this semantically risky reuse of policy
numbers. Acme's IT department is desperately trying to put off the evil day when they will have to
change from four-byte to five-byte policy numbers.

It doesn't take much reflection to realize that the changeover will be very "evil," i.e., very costly and
complicated. Existing policy numbers are everywhere – in primary keys, in foreign keys, across
several Acme databases, and in the databases of customers and partners. Those numbers are "in
flight" on transactions created but yet to be applied to Acme's databases. This can't be cleaned up
in one fell swoop, on one night between midnight and 6a.m. while all the data is unavailable for
access. It can't be cleaned up, in other words, in one multi-transaction atomic unit of work. While
the changeover from a four-byte to a five-byte client number is taking place, semantic disintegrity
will be exposed to business users. Joins will return only partial result sets. Code and SQL will
have to be rewritten to use "link tables" and, after the changeover is complete, rewritten again to
not use those link tables.

It reminds me of what Everett Dirksen, the senior senator from Illinois who almost helped to write
the Civil Rights Act of 1964, said about how the government spends money: "A billion here, a
billion there, and pretty soon you're talking about real money." Well, when it comes to changing
the syntax of the primary keys of critical enterprise tables, it may not be a billion here and a billion
there. But it can often be "a million here and a million there."

The Client Version and Policy Version Tables – Without Surrogate Keys

Figure 1 shows our Client Version and Policy Version tables, as we have seen them through the
first fourteen articles in this series. In these tables, the primary key is a natural key, not a
surrogate key, because the same columns that play the role of primary key also play the role of
business key.

Client-nbr and Policy-nbr. Although we have identified client-nbr and policy-nbr as surrogate keys,
each is really a hybrid. We said that the numeric part was surrogate because, even if it had the
semantics of chronological sequence, semantics were an accident of how the numbers were
created and could not be relied on. But we tacitly accepted the semantics of the alphabetic
portion, in which "C" designated a client key and "P" designated a policy key.

So because one part of those keys has business meaning, let's think of them as natural keys. In
other words, let's think of them as not being surrogate keys. (This notion of a primary key column
having "parts" brings up the old issue of "intelligent keys.” But intelligent keys have no relevance
to the issues being discussed here. However, for an extensive discussion of intelligent keys, see
Johnston's two articles on "Primary Key Re-Engineering Projects"(Part 1, Part 2).

Client-ver-beg-dt and Policy-ver-beg-dt. These primary key columns have business meaning,
because they designate business effective begin dates in both cases. Therefore, [client-nbr +
client-ver-beg-dt] and [policy-nbr + policy-ver-beg-dt] are business keys for versions of clients and
policies, respectively.

3 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 14 - ... http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

Figure 1: The Client Version and Policy Version Tables with Natural Keys

The Client Version and Policy Version Tables – With Surrogate Keys

Figure 2 shows our examples of a Client Version and Policy Version table, with surrogates used as
primary keys. The first thing to notice is that these tables still contain their original business keys.
Business keys are what provide semantic entity integrity. Primary keys provide physical entity
integrity. The distinction between semantic and physical entity integrity arises, of course, only
when we substitute surrogate for natural keys, and thus separate what the DBMS recognizes as
the primary key from what the business recognizes as the combination of data elements that
properly play the role of distinguishing each instance of the entity from every other instance.

This is a complex topic, and we will not say anything more about it in these articles. However, at
the DAMA 2008 Conference, Johnston's presentation will discuss this issue in greater detail.

Client-nbr, client-ver-beg-dt, policy-nbr and policy-ver-beg-dt have the same semantics they have
had all along, except that their role as primary keys has been removed. In the primary key roles,
we have provided two new columns for each table, an object identifier ("oid") and a version
identifier ("vid").

We have also "stacked" a few more columns, again so the picture of these tables could fit
comfortably on a page. So although the picture of each table suggests that there are six columns
apiece, in reality the Client Version table has nine columns and the Policy Version table has ten
columns. Or, if you prefer, the graphic columns are six and six, respectively, while the database
columns are nine and ten, respectively.

The Client Version and Policy Version tables, with surrogate keys, are shown in Figure 2.

4 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 14 - ... http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

Figure 2: The Client Version and Policy Version Tables with Surrogate Keys

For both oids and vids, we have used five-digit numbers. But the important thing is that a) these
numbers carry no business meaning whatsoever, and b) they play their roles as unique identifiers.
Their semantics as unique identifiers are, at a minimum:

No two policies may have the same policy-oid.


All versions of the same policy must have the same policy-oid.
No two versions of the same policy may have the same policy-vid.
And the same for clients.

For reasons described in Johnston's "Primary Key Re-Engineering Project" articles, we would
prefer additional semantics for oids and vids, namely:

No two objects, of whatever type (policies and clients, here) may have the same oid.
No two versions, of the same or of different objects, may have the same vid.

One final point before we conclude Part 14. Surrogate keys are usually single-column primary
keys. But our versioned tables are two-column primary keys. It isn't necessary to use two columns
– an oid and a vid – as the primary key of versioned tables. But it is very useful to have a system-
generated key for an object, with an additional distinct system-generated key for a version of an
object. What isn't necessary is that these two system-generated columns be used as the primary
key of a versioned table. Instead, we could have a typical surrogate key, i.e., one consisting of a
single column.

Why do we need a system-generated oid and also a system-generated vid? It is not for the sake of
semantics. As far as semantics is concerned, we could use a single-column surrogate key for
versioned tables, while retaining the client-nbr (or policy-nbr) and associated version begin date.
No oid, no vid.

Recall that client-nbr and policy-nbr were our somewhat simplified examples of business keys. So
it could just as easily have been that policies had a four-column business key while the client had
a three-column business key. Quite possible no two or more of those six columns would have the
same data type and length.

5 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 14 - ... http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

Why is that significant? It is significant because, all along, we have had a single, enterprise-wide
solution to versioning as our objective. Not one solution for one database and a slightly different
solution for another database. While we may be able to retrieve data from versioned tables with
SQL alone, it is not possible, as we will see, to update data in versioned tables with SQL alone.
For updating versioned tables, therefore, there is a codebase. Our enterprise objective, if we are to
take it seriously, requires that one and the same codebase be deployed for all versioning
solutions.

If an enterprise versioning solution is used, let us say, in six databases across the enterprise, then
there may be six physical copies of that codebase. An enterprise solution does not require
federated access to a single physical codebase instance. But it does require that all instances of
that codebase be line-for-line, byte-for-byte identical. The only alternative is to develop a custom
versioning solution for each database that requires versioned tables. While that may be the most
common way of implementing versioning in an enterprise, it is not our objective.

The Versioned Table Surrogate Key Position, Going Forward

The concept of surrogate keys is a contentious one. Publications arguing both for and against
their use go back at least twenty years. Arguments based on both theoretical and practical
concerns can be made for using surrogate keys and also for not using surrogate keys.

One of us (Johnston) has been an advocate of using surrogate keys for all kinds of tables, in all
kinds of databases. This means for dependent and associative tables, not just for the “kernel”
tables which those tables are dependent on, or relate. This means for tables in OLTP and data
mart databases, not just in ODS and data warehouse databases. We will refer to this as the
“strong surrogate key” position to distinguish it from positions which recommend surrogate keys for
some kinds of databases but not for others, or for some kinds of tables but not for others.

In Part 15, we will provide an outline of the major arguments for and against the strong surrogate
key position. This will not be an impartial review of arguments on both sides. It will be a distillation
of the arguments Johnston has made in half a dozen articles going back to 2000, and also a
preview of the presentation he will be making at the DAMA International Conference next March in
San Diego.

For more information on related topics, visit the following channels:

Data Management
Data Management

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

6 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 15 - ... http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 15 - For and
Against Surrogate Keys
Tom Johnston, Randall Weis
DM Direct, December, 2007

In Part 14, we introduced the concept of surrogate keys and showed how our example database of
clients and policies would look if we used surrogate keys with that database. We also introduced
some terminology that will help us examine the arguments for and against their use, terms such
as "business key," "natural key" and "mixed key." We then went on to define two sets of semantic
constraints that give surrogate keys their meaning. The minimal set consisted of three rules for the
Client table and a corresponding three rules for the Policy table. The full set consisted of an
additional two rules, both of which applied to both tables.

We also designated the approach to surrogate keys that we will use in the remainder of these
articles as the "strong approach." The approach we will contrast it to we will designate the
"standard approach," for such it is. We begin, then, by describing the standard approach and our
reasons for dismissing it.

Intuitive Data Reporting & Dashboards


Get instant visibility into your business with clickable access to your
data no matter where it resides. Customize reports & dashboards on
the fly.Try PivotLink now.

Neither the chart of version patterns nor the chart of this series are included in this article. The
reason is that this discussion of surrogate keys is relatively self-contained and makes few
references to earlier articles in this series.

Surrogate and Other Kinds of Keys: Definitions

Primary Key. The column or columns of a table in a relational database that the database
management system (DBMS) recognizes as the unique identifier column or columns for that
table.
Foreign Key. One or more columns of a table in a relational database used by the DBMS to
establish a relationship between that table and another (not necessarily distinct) table. At the
instance level, foreign keys relate pairs of rows such that each foreign key value is identical
to a primary key value in the referenced table.
Surrogate Key. A primary key whose values are system-generated, and which therefore
carry no business meaning. Surrogate keys are usually single-column keys.
Mixed Key. A multi-column primary key, some of whose columns contain system-generated
values and other of whose columns contain business-meaningful values.
Business Key. A unique identifier for a table, all of whose values carry business meaning.
Natural Key. A primary key that is a business key.

1 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 15 - ... http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

Surrogate Keys: The Standard Approach

When IT practitioners speak of a database that uses surrogate keys, they are usually referring to a
database some of whose tables have single-column surrogate primary keys, but other of whose
tables have multi-column primary keys whose values may or may not be system assigned. Thus,
they might use a single-column surrogate key for a Purchase Order table but create a mixed
primary key for a Purchase Order Line Item table, one consisting of a foreign key to the Purchase
Order table together with a line number. For an associative table matching up purchase orders to
receipts, they would typically follow the general practice for associative tables and create a primary
key by concatenating the primary keys of the Purchase Order and Receipt tables.

Let's call this the standard approach to using surrogate keys. What might justify the standard
approach as the best way to assign primary keys to the tables in a database? It would have to be
something like this. Whether or not surrogate keys are preferable to business keys is something
that should be determined for each table. Now consider a database all of whose tables have
natural keys. Among those tables, there is a subset whose primary keys do not contain any
foreign keys, i.e., which do not use "identifying relationships." Call them "kernel tables." Assign a
single-column surrogate key to these kernel tables.

That surrogate key will then become part of an identifying relationship in all of the other tables to
which each kernel key table is related. There are two types of such other tables: dependent tables
and associative tables. Dependent tables have primary keys which contain one surrogate foreign
key. Associative tables have primary keys that contain two surrogate foreign keys.

Our objection to this standard approach is that it makes the decision to use or not to use surrogate
keys at the level of individual tables. The result is a database in which some tables use surrogate
keys, other tables use mixed keys and yet other tables may use natural keys.

We have seen various rules of thumb offered for deciding whether to use a surrogate, mixed or
natural key for a particular table. For example, some database modelers/designers will always use
a natural key for reference/code tables. Their argument is that besides the value being referenced,
all that such tables contain are things like a short and a long description. Since for most purposes,
it's just the code that is wanted in a result set, using that code as the natural key of the reference
table means that a join to that table is not required in order to get the code value. ICD-9 codes on
a medical claim are a good example. If the ICD-9 code itself is the foreign key, the claim with its list
of ICD-9 codes can be materialized without needing to join back to the ICD-9 reference table.

We prefer to make the decision to use or not use surrogate keys at the database level. For all the
tables in the database, it's all or nothing. Either all of them use surrogate keys, or none of them
do. If single-column surrogate keys are being used, then the primary keys of all tables contain
exactly one column. This is true of associative and dependent tables, of core tables and code
tables. All use surrogate keys.

Moreover, on our strong approach to surrogate keys, no surrogate key value will appear in more
than one primary key instance anywhere in the database. A primary key is unique, in other words,
across all tables in the database, not merely unique within one table.

Argument Against the Standard Approach to Surrogate Keys

Our argument against the standard approach to using surrogate keys is simplicity itself. The
standard approach leaves us with a mixture of primary and foreign keys to manage that share no

2 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 15 - ... http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

consistent syntax and across which duplicate values are possible.

The argument for using surrogate keys rather than natural keys is simple: natural keys can
change, and that change can be expensive. As for change, the values in natural keys describe
something about what the row they are part of represents; that follows from the definition of
"natural key." Suppose one natural key is a two-byte column whose values represent geographical
sales regions and are: NE, NW, SE and SW. Now suppose the VP of Sales, after extensive
analysis and reflection, of course, decides to change sales regions as follows:

California is too big to be lumped in with the rest of the SW, so it will become its own sales
region, with "CA" as its code.
The NE and SE regions are also getting too big, but we don't want to carve out one or more
states to fix their problem. Instead, we will split them each into two regions, with designations
as follows: NE1, NE2, SE1 and SE2.

We invite you to reflect on the cost of such changes. Next, think about the cost of these changes if
IT had, instead of using natural keys, used a surrogate key for this Sales Region table.

Argument 1 Against Surrogate Keys: Users Don't Understand Them

Taken in one sense, this argument means that users don't know what surrogate key values stand
for. But of course, that's the whole point. The system generates values for surrogate keys, and
those values have no business meaning. No wonder users don't understand them.

Taken in another sense, this argument means that users don't understand the need for surrogate
keys, that they are uncomfortable when they see surrogate keys in their result sets and so on.

The need for surrogate keys can be explained to business users in two steps. First of all, primary
keys are needed to uniquely designate each row in each table. Secondly, the business may need
to occasionally require IT to changes values and/or data types and lengths of primary keys in
order to make the data being used as a unique identifier more accurately describe what it is about.
And when those changes are required, the costs of implementing them can be enormous. We saw
this above, in the argument against the standard approach to surrogate keys.

Argument 2 Against Surrogate Keys: Users Don't Use Them

This argument is that, because surrogate keys do not contain business data, users will never
select rows from tables by supplying surrogate key values. This is certainly true, but it doesn't
support the conclusion that users don't use them. On the contrary, surrogate keys, together with
the foreign keys that replicate their values, are used to join rows to other rows.

If a user wants information about a particular Policy, for example, we may assume that he knows
what the business key of that policy is. His query then begins with a Select operation on the Policy
table, with policy-nbr as the target, and a specific policy number as the search argument. Since
business keys should be unique, this process should select exactly one row (provided the
supplied policy number is valid).

Next, logically although not temporally speaking, the query proceeds to the Join operation in
which one or more other rows foreign key-related to the original row, are also selected.

Finally, again speaking logically and not temporally, the query proceeds to the Project operation in

3 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 15 - ... http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

which some columns are designated to be included in the result set, while the other ones drop
out.

The entire query, at this point, asks a question of the database. It says, "Do there exist one or
more instances of data that satisfy this specification?" If so, the resulting instances are returned,
each one as a row in a result set.

There is generally no good reason to include surrogate keys or their associated foreign keys in the
result set. There is generally no good reason for users to ever see a surrogate key value. For even
if the user writes his own SQL and uses surrogate keys and foreign keys in his joins, he does not
supply a surrogate key value anywhere in the query, and isn't interested in a surrogate key result.
Primary and surrogate key join pairs, like two adjacent links in a chain, are purely DBMS
machinery for linking data together.

Argument 3 Against Surrogate Keys: They Take Up Extra Space

This argument points out that if we used a table's business key as its primary key, we wouldn't
need an extra surrogate key column. Say that surrogate key is eight bytes long, and it occurs in a
million-row Client table. That's eight million unnecessary bytes, approximately 7.5MB extra for the
entire table.

While direct access storage device (DASD) space is relatively cheap, the argument will continue,
response time to a user's query is not. And the eight extra bytes on each row also contributes to
poorer response time, because fewer rows will fit into a single I/O block; fewer rows at a time can
be transferred between secondary storage and main memory.

In response, we need to distinguish between the impact of a surrogate key on the table it is the
primary key for and its impact across the entire database. In its native table, the Client table in this
case, adding a surrogate key will indeed add extra space to the storage requirements for the table
and will also add some amount of extra time when doing sequential retrievals, i.e., retrievals that
require moving I/O block after I/O block from DASD to main storage.

But let's consider all the tables in the database and nonsequential retrievals. Across all the tables
in a database, the use of surrogate keys may well reduce the total requirement for storage space.
For consider our eight-byte, single-column surrogate key, and compare it to a natural key of one
or more columns, which may well be longer than eight bytes.

Next, suppose that the Client table is an important table in the specific sense that many other
tables have foreign keys to the Client table, and for some of them those foreign keys are part of
their own primary keys, i.e., those foreign keys implement identifying relationships. Finally,
suppose that for some of these other tables with identifying relationships to the Client table, there
are further tables that have foreign keys to them.

Throughout this chain of direct and also indirect foreign keys to the Client table, and depending,
of course, on the number of other rows which instantiate these foreign keys, we may have dozens
or even hundreds of occurrences of Client table primary key values.

Let us assume that, at some point in time, there are 5,000 such Client foreign keys. And let us
assume, further, that the business key for the Client table would have been two columns, totaling
twelve bytes. The arithmetic here is simple. In the first case, the foreign keys total 60kb. In the
latter case, they total 40kb.

4 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 15 - ... http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

Now consider a database in which important tables like our Client table contain millions of rows.
And consider then the trade-off in space consumed between such a database in which surrogate
keys were used for all tables versus a database in which business keys were used for all tables.

There is indeed a significant difference in storage requirements between using and not using
surrogate keys. Surrogate keys are much more space-efficient.

Now let's briefly consider the argument that retrievals will take longer with surrogate keys because
fewer rows will fit into an I/O block. Although there are several things to note, we will take the time
to mention only one of them. This argument applies only when sequential retrievals are being
considered. When a business key value is in hand, and about to be used to select a client, then
after a lookup in the unique index defined on business key, only a single I/O block will be
retrieved. As for joins, once a primary key value is in hand, and about to be used in a join, then
after the lookup in the index, only a single I/O block will be retrieved. Any negative impact on
retrieval time caused by using surrogate keys is negligible to the point of being undetectable.

Argument 4 Against Surrogate Keys: They Require Extra Joins

This argument points out that whenever a surrogate primary key is used, then all relationships to a
row with such a key use foreign keys whose values are identical to that row's surrogate key value.
So whenever a query has specified a row which has such a foreign key, and whenever that query
wants to include all or part of the business key of the related row, a join is required to bring that
business key into the result set.

This argument is correct. However, it is possible to avoid joins that the use of surrogate keys would
otherwise make necessary. The solution is to denormalize and put duplicate columns on the
referencing table, columns that duplicate the business key of the row that the surrogate foreign
key is pointing to.

The Argument For Surrogate Keys

One of us (Johnston) will present a more complete case for using surrogate keys on all tables in a
database, and will do so at the DAMA 2008 Conference in San Diego in March 2008. But in brief,
the reasons we will proceed to discuss versioning in terms of sample tables that use surrogate
keys are these:

1. We want an enterprise solution to versioning, one applicable to any table and any database
without modifying the code that implements it. That requires that all versioned tables have
the same syntax for their primary keys, i.e., the same columns each with the same data
types, lengths and domains.
2. We want to avoid the usually very high costs of changing primary keys by adopting keys that
there will never be a business reason to change. Such keys are precisely those that carry no
business meaning and whose only semantics are to link related rows together.

A Physical Description of Earlier Version Patterns

In Part 15, we begin to revisit our earlier version patterns and provide a physical description of
them. This includes physical schemas and SQL, and shows how these version patterns work in a
real database. For a longer-term look at where this series is going, consult Part 13.

5 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases, Part 15 - ... http://www.dmreview.com/dmdirect/2007_48/10000383-1.html?type=pri...

For more information on related topics, visit the following channels:

Data Management
Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

6 of 6 10/7/2008 1:06 PM
Time and Time Again: Managing Time in Relational Databases Part 16 - ... http://www.dmreview.com/dmdirect/2007_51/10000487-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases Part 16 - Physical
Implementation of Version Pattern 1
Tom Johnston, Randall Weis
DM Direct, December 21, 2007

Starting with this installment in our series, we will begin to describe the physical implementation of
these versioning patterns. To this point, patterns have been presented from a logical perspective.
Data modelers would recognize these discussions as being about logical data models, rather than
physical data models. Almost equivalently, we could say that previous articles in this series have
discussed the semantics of versioning rather than its implementation.

We now turn to the corresponding physical data models and to their implementation in a real
database. Throughout the rest of this series, we will refer to articles that focus on implementation
as "PDM (physical data model) segments" of this series, where each segment is a separate article.
We will reference each pattern by version number, and we will use the illustrations and the
scenarios of the version pattern within the context of each of these PDM segments.

Integrating Data: An Open Source Approach


Any technology deployed in the information system needs to
interoperate with existing applications or databases. Learn through
real-life scenarios how Open Source data integration solves the
interoperability challenge. Free White Paper.

These PDM segments will show the physical Data Definition Language (DDL) used to create the
tables in the schema; the Structured Query Language (SQL) used to populate and update those
tables; and the SQL used to query the tables. We will also discuss performance and data integrity
in the physical database. In addition, we will describe and discuss known issues, database
management system (DBMS) limitations and alternatives to the implementations that are depicted
here.

We encourage you to create, populate and query these objects in your own database. By doing
so, you will deepen your appreciation of the temporal functionality provided by the early version
patterns, and of the more complete functionality provided by the later patterns. In addition, you
may learn to develop some of your own variations on these patterns, variations that might better
apply in your own DBMS and application environment.

Preliminaries

We will be using Microsoft SQL Server as the DBMS. However, with some minor adjustments, the
SQL and the designs suggested in these PDM segments could be used in other DBMSs, such as
IBM’s DB2, Oracle or MySQL. Various database objects will be created, including tables, views,
indexes, primary keys, foreign keys, triggers and stored procedures. However, we will assume that
the reader will create other needed objects, such as databases, containers, file groups and
tablespaces, and will have the appropriate permissions to create and use these objects.

1 of 4 10/7/2008 1:09 PM
Time and Time Again: Managing Time in Relational Databases Part 16 - ... http://www.dmreview.com/dmdirect/2007_51/10000487-1.html?type=pri...

In SQL Server the schema owner is often defined as “dbo.” We will not prefix the objects with
“dbo,” but that will be the implicit schema name. However, in many DBMSs the default schema
name is the same as the current SQL ID or login ID of the person that is connected to the
database. This should suffice for most of these examples.

Some DBMSs will implicitly create certain database objects based on the creation of other objects.
For example, some DBMSs will automatically create indexes based on the creation of primary keys
and foreign keys, but others will not. In many of our examples, we will include the DDL to create
objects that might be implicitly created by other DBMSs. Thus, you many encounter some minor
syntax or duplicate object errors when declaring these objects to your DBMS. In many cases, you
will be able to ignore these errors. We suggest that you confirm that the error does not
compromise the model, and that the fundamental purpose of the object, function or statement is
fulfilled in some other way.

The SQL we present will often use literals for the variables rather than parameter markers (host
variables). This is for illustrative purposes only. Normally, for systems that require high
performance, applications should use either statically bound SQL or prepared statements that use
parameter markers in dynamic SQL such as JDBC rather than dynamic literals. This strategy
saves a significant amount of overhead, avoiding constant reoptimization of the SQL’s access
path. So, while we will use literals in our examples, this is not the recommended approach and is
only used to clarify the examples.

Note also that some companies have standards to not use a “select” clause. We will use it in our
examples simply to save textual space.

The Application

The insurance industry has extensive date/time requirements for their application databases.
These include:

The date range an insurance policy is effective,


The date an event occurred,
The date a service was performed,
The date a claim was adjudicated, and
The date a claim was paid.

Often the dates when a policy or a policy-related event are recorded in a database do not match
the effectivity range of the policy or the date on which the related event actually occurred. This
distinction is one of the reasons we chose to use an insurance application in our examples.
However, many other industries have complex date-driven rules and business processes. In fact,
most large companies need to adhere to best practices or to various regulations, such as
Sarbanes-Oxley, that require appropriate accounting and tracking of data changes. Therefore, we
believe that while your application may not appear to currently need advanced version patterns, it
is likely that, in time, they will.

So, as you read these physical model segments and try out the examples, you should be able to
easily translate these PDM entities, attributes, objects and examples into your own industry’s
terms.

Keys

2 of 4 10/7/2008 1:09 PM
Time and Time Again: Managing Time in Relational Databases Part 16 - ... http://www.dmreview.com/dmdirect/2007_51/10000487-1.html?type=pri...

In non-versioned data models, and also in some of the earlier patterns in this series, the primary
key of the table is also considered the unique identifier of the object. For example, either a
business key or a surrogate key of the Policy object might be called policy_nbr, and it would be
the primary key in the physical database. However, as we progress through the more advanced
versioning patterns, we will discover that the object identifier alone will not uniquely identify a
single occurrence (row) in a table because multiple versions of an object may (and usually will)
exist in a versioned table. This restriction in current relational databases – that the distinction
between an object and its versions must be made by the developer and is not built-in to either
standard or vendor-augmented DDL – forces us to create a new surrogate key that represents the
composite of an object id and either a version date/time or a version id, which now becomes the
unique primary key.

Therefore, as you review these patterns, consider the fact that the object id will not always be
considered the primary key, regardless of whether it is a business key or a surrogate. So to help
you distinguish the different columns that would normally represent a single object and the
primary key in non-versioned tables, these columns (keys) will be suffixed with “OID,” meaning the
object id, and surrogate columns that distinguish versions of an object will be suffixed with “VID,”
meaning version id.

PDM Structure

These PDM segment articles will be structured as follows:

Reference to the version pattern as defined in the earlier articles.


A brief description of the pattern.
A table facsimile with sample rows.
The DDL to create the base table(s).
Supporting DDL, such as constraint, index or view creation.
Special design notes and considerations.
A copy of the scenario figure used in the original version pattern.
The Data Manipulation Language (DML) to update the database for each event.
SQL to query the results at various points in time.
The results of the queries in tabular format.

Figures 1 and 2: Templates that Will be Used (All figures are oversized and linked at the end of
the article) (See pdf).

Setup

The initial patterns will concentrate on the Policy table. We will retain pattern versions in separate
tables so we can compare results. Therefore, each Policy table will be prefixed with a “P#” to
identify each pattern number. We will also, as needed, reference a Client table. The following DDL
and SQL can be executed to create the base Client table.

Figure 3: Client Table Setup (see pdf)

Version Pattern 1

Now let’s see how Version Pattern #1 will look in a physical database.

3 of 4 10/7/2008 1:09 PM
Time and Time Again: Managing Time in Relational Databases Part 16 - ... http://www.dmreview.com/dmdirect/2007_51/10000487-1.html?type=pri...

Figure 4: Version Pattern 1 (see pdf)

Version 1 Scenario 1.1

Figure 5: Scenario 1.1 (see pdf)

Version 1 Scenario 1.2

Figure 6: Scenario 1.2 (see pdf)

Version 1 Scenario 1.3

Figure 7: Scenario 1.3 (see pdf)

These scenarios show the physical data model for the first version pattern, which has no history or
future dating requirements whatsoever, as well as insert, update, delete and retrieve SQL against
that physical model. This is the easiest version pattern to update and query, but by the same
token is the least expressive of temporal requirements. The date a row was created, and the date it
was last updated, provide the only persisted temporal information about the data.

Relational databases and SQL standards have not yet evolved to easily support various temporal
states. So, this series will continue to show you how to implement feature-rich version patterns into
your database that could also become part of your enterprise architecture. These physical design
segments will take this concept from theory to actual implementation. It’s about time!

For more information on related topics, visit the following channels:

Data Management
Databases
DW Basics

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

4 of 4 10/7/2008 1:09 PM
Object/Version PDM Pattern Template

Version: Pattern Number and Name

Description: Brief Description of Pattern

Table Facsimile
my_oid (pk) parent_oid (fk) attr_n_val crt_dt last_updt_dt
138 882 PPO 1/12/2004 1/12/2004

Table Creation DDL


Create table my_table (
my_oid int not null,
parent_oid int null,
attr_n_val char(10) not null,
crt_dt datetime default getdate() not null,
last_updt_dt datetime default getdate() not null) ;

Primary Key Index DDL


Alter table my_table Create unique index
add constraint pk_my_table pk_my_table
primary key (my_oid); on my_table
(my_oid);
Foreign Keys
Create clustered index
Alter table my_table
fk_ parent_child
add constraint fk_parent_child
on my_table
foreign key (parent_oid)
(parent_oid);
references parent (parent_oid)
on delete restrict;

Notes: Special design notes and options

figure 1
Scenario Template

Scenario: Version Pattern Number Scenario Number and Name

Description: Brief Description of Scenario

Scenario Figure

Insert SQL
Insert into my_table
(my_oid, parent_oid, attr_val, crt_dt)
VALUES(138, 882, 'PPO', '2004-01-12');
Query SQL
Select * from my_table
where my_oid =138;

Notes: Explanation of how the SQL works and any special notes regarding SQL and
criteria. Hints, tips options.

Query Results
my_oid parent_oid attr_n_val crt_dt last_updt_dt
138 882 PPO 1/12/2004 1/12/2004

figure 2
Client Table Setup

Create table client (


client_nbr int not null,
client_name varchar(50) not null,
crt_dt datetime default getdate() not null,
last_updt_dt datetime default getdate() not null) ;

Alter table client


add constraint pk_client
primary key (client_nbr);

Insert into client (client_nbr, client_name, crt_dt)


values (882, ‘Mary Client', '2004-01-19') ;

figure 3
Object/Version PDM Pattern

Version Pattern 1: Updating in Place

Description: No versioning with updates occurring in place, and the end of an


object’s episode is represented by the deletion of the object.

Table Facsimile
policy_nbr client_nbr policy_typ crt_dt last_updt_dt
(pk) (fk)
138 882 PPO 2004-01-12 2004-01-12

Table Creation DDL


Create table p1_policy (
policy_nbr int not null,
client_nbr int null,
policy_typ char(10) not null,
crt_dt datetime default getdate() not null,
last_updt_dt datetime default getdate() not null) ;

Primary Key Index DDL


Alter table p1_policy Create unique nonclustered
add constraint pk_p1_policy index ix_pk_p1_policy
primary key (policy_nbr); on p1_policy
(policy_nbr);
Foreign Keys
Create clustered
Alter table p1_policy
index ix_fk_p1_policy_client
add constraint pk_p1_policy_client
on p1_policy
foreign key (client_nbr)
(client_nbr);
references client (client_nbr)
on delete no action
on update no action ;

Notes: We clustered this table by the foreign key because access is normally by a
foreign key. The delete/update rule is set to “no action” which is similar to
“restrict”; however, cascade could be used, also. Remember, however, there is no
versioning.

figure 4
Scenario

Scenario 1.1 of Version Pattern 1: an Insert to the Policy Table

Description: shows the result of entering Mary's policy into the Policy table

Scenario Figure

Insert SQL
Insert into p1_policy
(policy_nbr, client_nbr, policy_typ, crt_dt, last_updt_dt)
VALUES(138, 882, 'PPO', '2004-01-12', '2004-01-12');
Query SQL
Select * from p1_policy
where policy_nbr=138;

Notes: Add and look up Mary’s Policy.

Query Results
policy_nbr client_nbr policy_typ crt_dt last_updt_dt
138 882 PPO 2004-01-12 2004-01-12

figure 5
Scenario

Scenario 1.2 of Version Pattern 1: an Update on the Policy Table

Description: On 3/14/05, Mary changes her policy type from PPO to HMO (Health
Maintenance Organization). After that change is recorded in the Policy table, Mary's
Policy row looks like this:

Scenario Figure

Update SQL
Update p1_policy
Set policy_typ='HMO', last_updt_dt='2005-03-14'
where policy_nbr=138;
Query SQL
Select * from p1_policy
where policy_nbr=138;

Notes: Update Mary’s Policy in place. We’ve lost all record that Mary originally
had PPO

Query Results
policy_nbr client_nbr policy_typ crt_dt last_updt_dt
138 882 HMO 2004-01-12 2005-03-14

figure 6
Scenario

Scenario 1.3 of Version Pattern 1: a Physical Delete to the Policy Table

Description: On 6/1/06, Mary's policy is ended (by contract expiration, explicit


termination, revocation, etc. - it doesn't matter). This is recorded, using Version
Pattern 1, by physically deleting her policy's row in the Policy table.

Scenario Figure

Delete SQL
Delete from p1_policy
where policy_nbr=138;

Query SQL
Select * from p1_policy
where policy_nbr=138;

Notes: Mary’s Policy is deleted. We’ve lost all record that Mary originally had
PPO, HMO, or had any policy whatsoever

Query Results
policy_nbr client_nbr policy_typ crt_dt last_updt_dt
[no row]

figure 7
Time and Time Again: Managing Time in Relational Databases, Part 17 - ... http://www.dmreview.com/dmdirect/2008_53/10000583-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 17 - Physical
Implementation of Version Pattern 2
Tom Johnston, Randall Weis
DM Direct, January 4, 2008

In this PDM segment, we will describe the physical implementation of Version Pattern 2. This
pattern was described, from a logical point of view, in part 3 of this series. This pattern expands on
pattern 1 by adding a logical delete capability.

This pattern is more semantically expressive than Version Pattern 1, just as each pattern
described in this series is more expressive than the patterns prior to it. This pattern allows us to
keep a record of deleted rows, and thus information about when they were first inserted, last
updated and, finally, deleted. A delete request which is carried out as a physical delete loses all
this information, removing all evidence, in the online table, that the row ever existed.

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

To support logical deletes, we need an additional column, either some kind of a delete indicator or
else a delete date. A delete indicator would be set to "N" when the row is first inserted, and set to
"Y" if and when the row is logically deleted. A delete date would be set to {null} when the row is
first inserted, and set to the date of the logical deletion if and when the row is logically deleted.

Remember, in this series when we reference a column as “date” we normally would consider a
“date/timestamp” to be more accurate and to provide more flexibility as to when we can change the
state of a row. We use dates in our examples to help keep them simple.

We prefer to use a delete date because it provides more information than a delete indicator. If a
delete indicator is used, we must decide to either leave the last update date alone, or to put the
delete date in that column. If we choose the first option, we have lost the delete date. If we choose
the second option, we have overwritten the date of the last update. But if we use a delete date
instead of a delete indicator, we can preserve both dates. In addition, as we will show later, it
becomes easier and more efficient to query when looking for rows that were active at a given point
in the past. Also discussed later in this article is the option of using a special date value rather
than {null}.

A variant of Version Pattern 2 would be to add a “delete reason” column. Together with a delete
date, this would tell us both when and why rows were logically deleted. For example, a row in a
Customer table might have been logically deleted because of no recent orders, bad credit, etc.
And, of course, the combination of insert date and delete date tells us how long the row was
logically present in the table. (This may not correspond to how long the customer was active,

1 of 4 10/7/2008 1:11 PM
Time and Time Again: Managing Time in Relational Databases, Part 17 - ... http://www.dmreview.com/dmdirect/2008_53/10000583-1.html?type=pri...

however. It will correspond only if the business chooses to treat row insertion and deletion dates
as also representing customer effectivity dates.)

With rows logically rather than physically deleted, it is also possible to maintain referential integrity
for rows in dependent tables. Otherwise, a physical deletion would force us either to set foreign
keys to {null}, or to cascade delete those dependent rows. In certain types of queries, data from a
logically deleted parent row might frequently be usefully joined to nondeleted child rows. This
would provide data about the parent row related to those child rows, a relationship that was in
effect between the later of the insert dates for the parent and the child rows, and the delete date
of the parent row.

Foreign Key Constraints

Most popular database management systems (DBMSs) allow a “Set Null” delete option. With this
option, if there are one or more rows in one or more dependent tables that have a foreign key to a
row being deleted, then those foreign keys are set to {null}. As just mentioned, while this approach
retains the dependent rows, it loses the link to the parent row. So a logical delete allows us to
retain more information than a physical delete which uses the Set Null option.

These same DBMSs also provide for a “Cascade” delete option. With this option, if there are one
or more rows in one or more dependent tables that have a foreign key to a row being deleted, then
those dependent rows are also deleted. If other rows have foreign keys to these dependent rows,
then those other rows will also be deleted. Usually, Cascade delete options are avoided in most
financial and other mission-critical applications. A Cascade delete retains less information than a
Set Null delete against the same set of rows because a Cascade delete removes the dependent
rows while a Set Null delete retains the rows and merely sets their foreign keys to {null}. So a
logical delete also allows us to retain more information than a physical delete with the Cascade
option.

The third common delete option is called “No Action” or “Restrict.” No Action and Restrict are
nearly identical, and in DBMSs that support both, the only difference is when the constraint is
enforced – before or after other constraints. This constraint is used as a safety valve to prevent the
inadvertent deletion of dependents when a parent row is physically deleted and the inadvertent
deletion of a parent row that has dependents. Therefore, in this physical implementation of
Version Pattern 2, we will use the “Delete No Action” option because no rows should ever be
physically deleted, but if someone tries to physically delete a row, this option will prevent the
deletion if that row has dependent rows.

This will not, however, prevent the physical deletion of rows that do not have dependents. So, if
we really want to prevent the physical deletion of rows in a table, the safest approach would be to
revoke the delete permission on the table from the users, or not grant delete authority in the first
place. This will force the user to update the logical delete indicator (or logical delete date) in order
to perform a business deletion.

A logical delete date or indicator column can help maintain referential integrity if we want to
inactivate a parent table row while retaining the dependents’ relationship. For example, we may
want to inactivate a customer but retain both the orders they placed and the foreign key
relationship. This is helpful for several reasons, including the ability to find the customer name and
address of an order that was shipped last year, even though the customer is no longer active.

This approach also gives us the ability to reactivate the customer if conditions change – without

2 of 4 10/7/2008 1:11 PM
Time and Time Again: Managing Time in Relational Databases, Part 17 - ... http://www.dmreview.com/dmdirect/2008_53/10000583-1.html?type=pri...

having to create a new customer entity occurrence – thus retaining the customer's previous
history. But if we do this, we will lose the information that this customer was logically deleted and
inactive for a period of time before its re-activation. To handle re-activation without loss of
information required a more advanced version pattern.

Figure 1 shows how Version Pattern 2 will look in a physical database.

Figure 1: Version Pattern 2 (see PDF below)

Figure 2: Version 2 Setup (see PDF below)

Starting in this PDM segment we will include a setup section that will show the preparation of the
data and chronology of SQL events leading up to the Scenarios.

Figure 3: Version 2 Scenario 2.1 (see PDF below)

Figure 4: Version 2 Follow-Up (see PDF below)

We will also perform some follow-up SQL that shows various ways the data can be queried.

Two Approaches to a Delete Date

In the physical implementation of our remaining versioning patterns, we will not use {null} in delete
dates. Instead, we will use the special value “12/31/9999.” Two reasons to use this special value
are to simplify queries and to improve performance.

For example, if a delete date is inclusive and nullable, the SQL that looks for active rows at a given
point in time may be written as:

“… AND (crt_dt <= ‘01/01/2005’ AND (del_dt >= ‘01/01/2005’ OR del_dt is null)) ”
…using the table facsimile in Figure 5.

Figure 5: Version 2 Follow-Up (see PDF below)

This query will return policies 138 and 139 because they were both active on 01/01/2005.

However, if we used “12/31/9999” for the delete date, and it is inclusive, the SQL may be simply
written as:

“… AND (:my-date between crt_dt AND del_dt ) ”

or another way:
“… AND (crt_dt <= :my-date AND del_dt >= :my-date) ”
where :my-date is ‘01/01/2005’ using the table facsimile in Figure 6.

Figure 6: (see PDF below)

This query will also return policies 138 and 139 because they were both active on 01/01/2005.

Using “12/31/9999” instead of {null}, the SQL is clearly simpler to write. In addition, performance

3 of 4 10/7/2008 1:11 PM
Time and Time Again: Managing Time in Relational Databases, Part 17 - ... http://www.dmreview.com/dmdirect/2008_53/10000583-1.html?type=pri...

will be better because we have removed the “OR” from the optimization criteria. Many DBMSs do
not optimize ORs very well, and often will alter the access path to something less efficient
compared to similar SQL without the OR. Using “12/31/9999,” we may also decide to put an index
on the delete date, usually combined with some other columns such as client_nbr.

Wrap-Up

This article has shown how to implement Version Pattern 2, which builds on Version Pattern 1 by
providing a logical delete to use in place of a physical delete. However, updates using this version
pattern still lose a great deal of information. This pattern does not retain previous versions of an
object, nor can it distinguish between re-activated objects and those which were never deleted.
Also, we’ve only discussed one dimension of time, the system dates of the row's creation and
deletion. Thus, Version Pattern 2 does not manage bi-temporality. Its dates are dates of database
activity. Business meaningful dates – specifically the dates on which a version of an object
became effective and ceased being in effect – are not explicitly represented in Version Pattern 2.

In the next several articles, we will proceed to demonstrate how to physically model the more
advanced semantics of the higher version patterns. We will also present the SQL needed to
maintain and query these tables when these more advanced patterns are used.

Time is on our side!

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

4 of 4 10/7/2008 1:11 PM
Time and Time Again: Managing Time in Relational Databases, Part 18 - ... http://www.dmreview.com/dmdirect/2008_55/10000644-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 18 - Physical
Implementation of Version Pattern 3
Tom Johnston, Randall Weis
DM Direct, January 18, 2008

In this article, we will describe the physical implementation of Version Pattern 3. This pattern
expands on Patterns 1 and 2 by retaining a history of updates. This pattern was logically
described in part 4 and part 5 of this series.

Pattern 2 provided the ability to query objects that were previous deleted. However, it failed to save
history created by updates. In Pattern 3, we will start saving this update history by using
versioning, and will lay the groundwork for the more robust history patterns that follow.

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

Version Tables

One way to keep track of updates is to use history tables that are nearly clones of the object table,
and are populated by using update triggers. There are pros and cons with using this approach,
which we will discuss later. Pattern 3 takes an approach where all of the versions (current and
past) are stored in the same table. A Policy was the object in the Policy table in Patterns 1 and 2.
Now, the table will contain multiple versions of the same object. Therefore, to distinguish these,
the table that contains the versions of a policy will be called the “Policy Version” table.

While you might assume that having a policy version table would also require a policy (object)
table, this is not so. We will only populate and query the Policy Version table for our current and
previous versions of an object. Some organizations might demand both tables for various reasons,
but it is not required to support the semantics of this pattern. There are benefits to having only one
(version) table to update, manage and query, so we will describe this pattern with this design
decision.

Version Primary Key

Having versions of the same object in the same table requires a change to the primary key of the
table. Policy number (policy_nbr) was the primary key in Patterns 1 and 2. However, policy
number alone is no longer a valid primary key now that we are storing versions in the same table.
We need something else as part of the primary key to distinguish multiple versions of the same
policy. An alternative might be to replace policy number with a surrogate key. But since surrogate
keys were not introduced as part of these earlier patterns, we will use a composite key made up of
the policy_nbr and something else to define the version.

1 of 4 10/7/2008 1:12 PM
Time and Time Again: Managing Time in Relational Databases, Part 18 - ... http://www.dmreview.com/dmdirect/2008_55/10000644-1.html?type=pri...

Figure 1: Correlated Subselect for Max (see PDF)

That “something else” could be a sequence number (version number), but for our examples we
will use the version beginning (system activity) date (ver_beg_dt). Remember, we use dates in our
examples to simplify them, but we really would use a date/timestamp to allow for multiple events
per day. The primary key will be policy_nbr and ver_beg_dt.

The current version is identified by being the row for the policy that has the highest (most recent)
version begin date. So, to find the most current version you would need to find the highest (MAX)
date for the policy, then use that date to qualify the ver_beg_dt to get the single current row. This
can be accomplished by using a correlated subselect, as shown in Figure 1. Either way, this
approach has negative performance implications in applications that read these rows. We have
seen this performance problem in implementations of this pattern, and it is often not discovered
until there is a larger quantity of history and a load is placed on the application. This issue will be
addressed and overcome in future patterns in this series, so stay tuned.

Returning a version of an object “as of” a certain point in time, while possible, is a bit more
complex. You need to find the specific row that began (beg_dt) prior to the input date, but it must
less than the next version that is higher. It is not a simple “between” predicate. This kind of query
is shown in Figure 2.

Figure 2: “As Of” (see PDF)

The SQL in the following scenarios after the product data management (PDM) design will help
demonstrate how to retrieve the various versions using an input parameter representing the as-of
date.

Remember, this pattern only works with a single dimension of time. It is the time that the event is
recorded in the computer system. We assume for this pattern that this date is both the system
activity and business effectivity date. The later patterns in this series will show us how to work with
both dimensions of time. But, for now, here is Pattern 3 – Update History.

Version Pattern 3

Figure 3: Policy – Version 3 – PDM Pattern (see PDF)

Version 3 Scenario 3.1

Figure 4: Scenario 3.1 (see PDF)

Version 3 Scenario 3.2

Figure 5: Scenario 3.2 (see PDF)

Version 3 Scenario 3.3

Figure 6: Version 3 Scenario 3.3 (see PDF)

Version 3 Follow-up

2 of 4 10/7/2008 1:12 PM
Time and Time Again: Managing Time in Relational Databases, Part 18 - ... http://www.dmreview.com/dmdirect/2008_55/10000644-1.html?type=pri...

As explained in prior PDM articles, we will expand upon the original scenarios with a follow-up
section that demonstrates additional capabilities of the pattern. This follow-up for pattern 3 shows
one of the benefits of storing the current and past versions in the same table by being able to use
the same SQL and application logic to retrieve and process both the current and previous
versions, just by supplying a different input date.

Figure 7: Version 3 Follow-up 3.4 (see PDF)

Figure 8: Version 3 Follow-up 3.5 (see PDF)

One or Two Tables?

Pattern 3 combines both the current version and the past versions into a single table. There are
pros and cons to this, both physically and logically. One of the semantic constraints of Pattern 3 is
that date represents both the business effectivity date and the system activity date. So, the date in
the computing system on which it appears is the date that the event is effective for the business.
This also implies that we cannot accurately allow retroactive or future date processing; that is, an
event cannot be dated prior to the current version, nor can it be dated in the future.

Consequently, it would not significantly impact the application if the various versions were stored
in one or two tables, as far as Pattern 3 is concerned. However, in future patterns we will show
why we combine all versions in a single table to better support bi-temporal patterns and retroactive
and future date processing.

Following is a list showing some of the pros and cons combining or separating versions into
tables.

Figure 9: Versioned Tables versus Current Plus Historical Tables (see PDF)

If you are trying to retrofit history into an existing database and application that has no history,
then you might consider creating a separate history table, populated by triggers on the original
base object table. Otherwise, for a new application we lean towards combining the versions into
the same table, possibly with partitioning, with the non-current versions.

Wrap-Up

This article has shown how to implement Version Pattern 3. This pattern allows you to easily query
any version, current or past, based on the supplied as-of date input parameter. This reduces the
number of database objects, application objects and SQL required for the application. More
importantly, we start to better and more accurately retain history that is required by governing
bodies and regulations, such as Sarbanes-Oxley (SOX), National Association of Insurance
Commissioners (NAIC) and Securities and Exchange Commission (SEC). It also helps support
auditing requirements and best practices desired by most major companies today. However, it
does not yet support accurately post-dating or pre-dating database events.

In upcoming articles, we will continue to demonstrate how to physically model these more
advanced temporal patterns, and will also present the DDL to create these tables and their related
database objects, plus the SQL needed to maintain and query the data.

Stay with us as we travel through time!

3 of 4 10/7/2008 1:12 PM
Time and Time Again: Managing Time in Relational Databases, Part 18 - ... http://www.dmreview.com/dmdirect/2008_55/10000644-1.html?type=pri...

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

4 of 4 10/7/2008 1:12 PM
Time and Time Again: Managing Time in Relational Databases, Part 19 - ... http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 19 - RI and
Temporal RI
Tom Johnston, Randall Weis
DM Direct, February 1, 2008

This installment should have been the continuation of our presentation of physical
implementations of the version patterns discussed thus far. It should have presented a physical
implementation of Version Patterns 4 and 5. However, because editors are just plain tough when it
comes to deadlines, and because of some personal exigencies with the authors, we are unable to
stick to this logical sequence. Instead, in this article, we will consolidate and extend some of the
things we have learned about the semantic constraints that apply to versions, and especially to
what referential integrity means when applied to versions.

In terms of the roadmap presented in Part 13, this means that we are putting Step 2 aside, for the
moment, and moving on to Step 3, the semantics and implementation of Version Pattern 6. The
first part of those semantics which we wish to discuss is what we will call "temporal referential
integrity" (temporal RI).

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

A glossary is included at the end of this article. As we have been writing this series, we have
developed a set of terms which have been defined more by means of introducing and using them
than in any more formal way. Of course, the later Wittgenstein (someone whose name C. J. Date
likes to invoke) did indeed say "Meaning is use," but it is well-known (by philosophers, at least)
that the later Wittgenstein is notoriously difficult to interpret and, consequently, correspondingly
inappropriate as a source of quotations taken out of context. This glossary, like dictionaries when
they are periodically undergoing revisions, takes usage to be prescriptive of meaning, but not
identical to it. And in so doing, it and we part company with the later Wittgenstein for whom most
philosophical problems need to be dissolved, not solved.

RI and Temporal RI

The referential integrity (RI) that we all know and love is RI that the database management system
(DBMS) can enforce, based on declarations we make to it when we submit our data manipulation
language (DML). It is RI against nonversion tables. For clarity's sake, we have been calling
non-version tables "object tables," not in the object-oriented sense, but rather in the sense of
things that persist through time, things that versions are versions of. Thus, a Customer table is a
table of customers, which are instances of a certain kind of object, and which persist through time.
But a Customer Version table is not a table of customers; it is a table of chronologically
sequenced, nonoverlapping, time slices of customers.

1 of 6 10/7/2008 1:13 PM
Time and Time Again: Managing Time in Relational Databases, Part 19 - ... http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

As we continue, we will use the term "object RI" when we think the adjective is needed for clarity,
and simply "RI" when context makes it clear that we are talking about object RI. Finally, we will
always use the term "temporal RI" when talking about RI as it applies to versions.

Object RI

Here is the definition of object RI:

If there is an object RI dependency from table Y to table X (not necessarily distinct), then no state
of the database is valid in which any row in table Y is not foreign key-linked to a row in table X. No
inserts, updates or deletes are valid that would transform the database into such a state.

The RI machinery of the DBMS insures this by carrying out the following checks. We assume
throughout this discussion, unless otherwise noted, that all foreign keys are defined as
non-nullable.

1. When inserting a row into object table Y, that row must reference a row in object table X that
already exists in the database (or that is being inserted as part of the same atomic
transaction).

2. When updating the foreign key to a row in object table X, from a row in object table Y, that
new foreign key must reference a row in X that already exists in the database (or that is
being inserted as part of the same atomic transaction).

3. When deleting a row in object table X, no rows may remain in object table Y, after that
transaction is complete, that reference that row.

With insert, update and delete actions, and two tables, we have a combination of six table/action
combinations. Notice that three of them are not mentioned as constraints that object RI must
enforce. Inserting or updating a parent row cannot affect RI, nor can deleting a child row or
updating it in any way that leaves its foreign key to its parent unchanged.

Note that for the time being, we are not concerned with the different ways that the DBMS can
enforce constraint number 3 above – those ways being a) block, b) set null and c) cascade
delete.

Temporal RI

We can now define and illustrate the enforcement of temporal RI in a way that emphasizes that
object and temporal RI are both forms of referential integrity, but with the latter bringing time into
consideration. But before we do, here is a depiction of what Version Pattern 6 looks like, using our
Client Version table as an example.

Figure 1: Version Table Schema for Version Pattern 6

This schema will be refined later on, as we begin our discussion of Version Pattern 6. It is

2 of 6 10/7/2008 1:13 PM
Time and Time Again: Managing Time in Relational Databases, Part 19 - ... http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

presented here simply to provide a graphic image of the various time-implementing columns that
are discussed below.

So here is the definition of temporal RI, stated initially as a constraint involving two version tables
As we will see later, we must also consider temporal RI constraints, and their enforcement,
between two tables one of which is a version table and the other of which is an object table:

If there is a temporal RI dependency from version table Y to version table X (not necessarily
distinct), then no state of the database is valid in which any row in table Y is not object R- linked to
a row in table X, or in which the effectivity time period of a row in table Y is not wholly contained
within the effectivity time period of its parent row in table X. No inserts, updates or deletes are valid
that would transform a database into such a state.

Now for the enforcement of temporal RI. No existing DBMS provides the machinery to carry out
this enforcement. Instead, it will be up to us, the developers, to support temporal RI by writing our
own code, possibly using triggers, to enforce the following constraints on activity against the
database:

1. When inserting a row into version table Y, that row must reference a row in version table X
that a) already exists in the database (or that is being inserted as part of the same atomic
transaction), and b) whose effectivity time period wholly contains the effectivity time period of
the row being inserted into Y.

2. When updating the foreign key to a row in version table X, from a row in version table Y, that
new foreign key must reference a row in X that a) already exists in the database (or that is
being inserted as part of the same atomic transaction), and b) whose effectivity time period
wholly contains the effectivity time period of the row being updated in Y.

3. When deleting a row in version table X, no rows may remain in version table Y, after that
transaction is complete, that reference that row.

4. No update to a row in version table X (where a change in primary key is not considered an
update) has any effect on RI constraints, other than a change to that row's effectivity time
period. In that case, when updating the effectivity time period of an object X (which is done
by creating a new current version of X that specifies the new time period), that effectivity time
period may not be shortened as long as one or more object-RI linked rows in version table Y
have an effectivity end date beyond that of the shortened end date for object X.

Recall, from our discussions thus far, that deletions in version tables are always logical deletions.
That is, they are always done by marking the current version of an object deleted as of an
indicated date, not by physically removing a row from a table.

Informally, what is going on is this:

Temporal RI is object RI plus the constraint that the time period of a child row must always be
wholly contained with the time period of its parent row.

If we keep this in mind as we begin to look more closely at the details of implementing temporal RI
enforcement, those details will be much less likely to confuse us. All of those details exist to
support this one-sentence definition of temporal RI.

3 of 6 10/7/2008 1:13 PM
Time and Time Again: Managing Time in Relational Databases, Part 19 - ... http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

Glossary

Child table, child row. Y is a child table to X if and only if there is a foreign key dependency from Y
to X. A row in Y is a child to a row in X if and only if the row in Y has a foreign key whose value is
identical to the primary key value of that related row in X.

1. Parent/child relationships typically have a maximum cardinality of one/many, and a minimum


cardinality of optional for the parent and required for the child. But it is a matter of which
table contains the foreign key and which table is the reference of that foreign key that
differentiates parent from child tables and rows. Cardinality constraints are not what make
the difference.

Effectivity time period. The period of time during which a version is regarded as the truth about an
object.

1. In our version patterns, an effectivity time period is defined by an effectivity begin date and
an effectivity end date, where "date" may be a calendar date or any other "tick of the clock"
(as described in Part 2).

2. Our convention is that the time period begins on the begin date, but ends one clock tick prior
to the end date.

Object referential integrity, object RI. Referential integrity for objects, as opposed to object
versions.

1. This is what the term "referential integrity" refers to, outside of these articles. It is the integrity
constraint, enforced by all current RDBMSs, that every non-null foreign key must contain a
value identical to the value in the primary key of a row in the RI-related table.

Object table. A table whose rows represent persistent objects. Sometimes called a
"non-versioned" table.

1. Persistent objects are things that exist over time and can change over time, such as vendors,
customers, employees, regulatory agencies, products, services, bills of material, invoices,
purchase orders, claims, certifications, etc.

2. From an OLAP, star-schema point of view, dimension tables are tables of persistent objects.
For a more in-depth discussion of different types of tables, see the articles "An Ontology of
Tables," at MindfulData.com.

3. From an OLTP database point of view, assuming that there are no version tables in the
database, object tables are all the tables which are not transaction tables.

4. Roughly speaking, object tables are the tables which are the concern of master data
management.

Parent table, parent row. X is a parent table to Y if and only if there is a foreign key dependency
from Y to X. A row in X is a parent to a row in Y if and only if that row in Y has a foreign key whose
value is identical to the primary key value of that row in X.

1. Parent/child relationships typically have a maximum cardinality of one/many, and a minimum

4 of 6 10/7/2008 1:13 PM
Time and Time Again: Managing Time in Relational Databases, Part 19 - ... http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

cardinality of optional for the parent and required for the child. But it is a matter of which
table contains the foreign key and which table is the reference of that foreign key that
differentiates parent from child tables and rows. Cardinality constraints are not what make
the difference.

Temporal referential integrity, temporal RI. Referential integrity for versions, which consists of
object RI plus the constraint that the effectivity time period for child rows are wholly contained with
the effectivity time periods of their object RI-related parent rows.

Version table. A table whose rows represent versions of persistent objects.

1. A version of a persistent object is a time-slice of that object. A row in a version table


represents a version, and describes that object as it was, is, will be and/or might be during
its effectivity time period.

2. A row in a version table is what the custodians of the table believe is the truth about the
object it represents, during the indicated effectivity time period.

Wholly contained in (for effectivity time periods). Time period 2 is wholly contained in time period 1
if and only if the effectivity begin date of time period 2 is equal to or later than the effectivity begin
date of time period 1, and the effectivity end date of time period 2 is equal to or earlier than the
effectivity end date of time period 1.

1. In order to implement this constraint in today's production databases, we rely on SQL data
types and operators. Specifically, we rely on dates and/or timestamps, and the DBMS
implementation of comparison operators for those data types.

2. Also for ease of current implementation, we use "12/31/9999" in a way whose intracacies will
be discussed in Part 20. For now, we note that when we use "12/31/9999," we let the DBMS
interpret that string as a valid date. However, semantically, it almost never would be a valid
date, because no business activities that we know of are concerned with an effectivity time
period that extends up to but not including or beyond that date nearly eight-thousand years
in the future.

Wrap-Up

In Parts 10 through 12 of this series, we discussed RI as it applies to version tables. There we


concluded that when managing rows in version tables, the only RI we needed to enforce among
those rows was object RI, RI expressed by means of object identifiers (OIDs).

Clearly, if this is all that were done to check the validity of insert, update and delete transactions
against version tables, it would be possible to introduce violations of temporal RI. As we will
explain next time, our concern was to avoid the performance penalty of enforcing temporal RI on
inserts, updates and deletes to the database. This concern, indeed, is the principal reason that IT
professionals tend to avoid versioning as much as possible.

In Part 20, we will examine our reasons for taking that position, and will explain why it is still a
reasonable position to take. However, we will look more closely at the performance implications of
doing temporal RI checking on insert, update and delete transactions. Those implications may not
be as forbidding as we had originally thought them to be.

5 of 6 10/7/2008 1:13 PM
Time and Time Again: Managing Time in Relational Databases, Part 19 - ... http://www.dmreview.com/dmdirect/2008_57/10000746-1.html?type=pri...

We will also begin a discussion of the value "12/31/9999" which may appear in effectivity end
dates. In date or timestamp data types, this string is understood by the DBMS to represent an
actual date, nearly eight-thousand years in the future. But from a semantic point of view, that
string does not represent a date. It functions as a temporal {null}.

As such, we shall argue, its most mathematically secure management would require the use of a
temporal three-valued logic. But fortunately, our focus is on implementation with today's
technology. So just as {null}s in today's SQL fudge a robust three-truth-valued logic, we will show
how these "temporal {null}s" can also limp along without the support of a secure foundation in a
three-time-valued (modal) logic.

Beyond these issues, we must also discuss temporal RI as it applies to future versions, to
hypothetical versions, and also to versions which correct errors in other versions. This will take us
several installments further into this series, at which point we will resume our presentation of
physical implementations of Version Patterns 4 and 5.

After that, we will begin our discussion of Version Pattern 6.

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

6 of 6 10/7/2008 1:13 PM
Time and Time Again: Managing Time in Relational Databases, Part 20 - ... http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 20 - RI and
Temporal RI (Continued)
Tom Johnston, Randall Weis
DM Direct, February 15, 2008

In this installment, we will begin a discussion of how temporal referential integrity (RI) may be
enforced with today's relational database management systems (RDBMSs). In addition, we will
expand the glossary which we began last time, in Part 19. Indeed, the glossary work has proven to
be the major portion of the work done here in Part 20. This seems to be the right point in our
series to begin formalizing the definitions of the terms we have introduced.

Object RI and Temporal RI: A Clarification

Integrating Data: An Open Source Approach


Any technology deployed in the information system needs to
interoperate with existing applications or databases. Learn through
real-life scenarios how Open Source data integration solves the
interoperability challenge. Free White Paper.

In Part 19, we said that by "object RI" we meant the normal DBMS-enforceable referential integrity,
implemented by populating a column designated as a foreign key to table X with the value of one
of the primary keys in that table. That was wrong, and is indicated as such in the glossary by
shading that entry. A new corrected definition is also included.

On that basis, we went on to give the following definition of "temporal RI": If there is a temporal RI
dependency from version table Y to version table X (not necessarily distinct), then no state of the
database is valid in which any row in table Y is not object-RI linked to a row in table X, or in which
the effectivity time period of a row in table Y is not wholly contained within the effectivity time
period of its parent row in table X. No inserts, updates or deletes are valid that would transform a
database into such a state.

As this discussion should make clear, our corrected definition of the term "object RI" does not
require us to amend the above definition of "temporal RI."

Notice that in this definition, the dependent-on table (X) is itself a version table. Consequently, it
contains no rows for objects, only rows for versions of objects. To illustrate, consider this
schematic of a version table, also copied from Part 19. We also add a schematic of a dependent
table which, in our ongoing example, is the Policy table.

1 of 7 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 20 - ... http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

Figure 1: Version Table Schema for Version Pattern 6

In the Policy table shown here, the notation (OFK) stands for object foreign key. The object
mentioned in this definition is, in this example, a specific client. The client is represented by the
OFK column client-nbr.

If the referenced table were itself an object table, and not a version table, then client-nbr would be
a "normal" foreign key, not an OFK. It would contain a value that was unique in the designated
primary key column of the referenced table, and its validity could be checked by the standard RI
mechanism of the DBMS (see Glossary). But the definition specifically states that it is a definition
that involves two version tables. Consequently, client-nbr in the referenced table, by itself, is not a
complete primary key. So it is not something the DBMS can enforce.

So what does "object RI" mean when there is no unique object to refer to, when, whatever it is, it is
not enforceable by a RDBMS? By the same token, what kind of foreign key is an OFK?

An OFK column contains the value of a unique object even though a table with that value as
primary key does not exist. It is as if, instead of Figure 1, versioning were implemented as shown
in Figure 2.

2 of 7 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 20 - ... http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

In the as-if situation depicted in Figure 2, these are the foreign key dependencies:
Policy Version table to Policy table.
Client Version table to Client table.
Policy table to Client table.

These RI dependencies are all object RI dependencies because in all cases, the table referenced
by a foreign key is an object (a non-version) table. In addition, all these object RI dependencies
can be enforced by the RDBMS. So to keep this distinction clear, we will use the term "standard
RI" (or "RI" by itself) to mean RI among non-versioned tables. Standard RI is RI that is enforceable
by the DBMS.

This leaves us free to redefine the term "object RI." We will now use it to mean a constraint that
applies between a pair of tables, where the referenced (parent) table is a version table. In that
referenced version table, one part of its primary key will be an identifier for the object whose
version it is. In the referencing table, there is a column which is constrained to contain a value
matching the object identifier part of a version in the referenced table which is current at the time
the referencing (child) row was inserted. That column is what we are calling an OFK. In addition, if
the referencing row is not in a version table, then an update in place can be applied to it, of
course. If that happens, the object RI constraint requires that the new value put in the OFK
column must match the object identifier in a version current at the time of the update.

The Glossary below contains the indicated corrections, and also includes new expressions not
defined previously.

3 of 7 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 20 - ... http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

Glossary

Child table, child row. Y is a child table to X if and only if there is a foreign key dependency from Y
to X. A row in Y is a child to a row in X if and only if the row in Y has a foreign key whose value is
identical to the primary key value of that related row in X.

1. Parent/child relationships typically have a maximum cardinality of one/many, and a minimum


cardinality of optional for the parent and required for the child. But it is a matter of which
table contains the foreign key and which table is the reference of that foreign key that
differentiates parent from child tables and rows. Cardinality constraints are not what make
the difference.
2. Also sometimes referred to as “dependent tables” or “dependent rows,” or various cognates
(e.g., “RI-dependent tables,” “RI-dependent rows”).

Current version, most recent version. A version of an object whose effectivity begin date is
chronologically the latest across all non-future versions of that object, and whose effectivity end
date is either unknown (represented by “12/31/9999”) or in the future.

1. Until we begin discussing Version Pattern 7, we will assume that no two versions of the same
object can have the same effectivity begin date. After all, between that date and the earliest
effectivity end date between them, that would mean that we had two different statements of
what was true about an object during that period of time. But there can only be one version
of the truth.
2. With Version Pattern 7, we will discuss cases in which two or more versions, of the same
object, might have the same effectivity begin date.

One source of such cases is when there is a business requirement to correct an error
in a version table, but also to retain, as queryable history, the version discovered to be
in error.
Another source of such cases would be where one of a few number of things might be
true of an object in a given time period, but we don’t know which one.

Effectivity time period. The period of time during which a version is regarded as the truth about an
object.

1. In our version patterns, an effectivity time period is defined by an effectivity begin date and
an effectivity end date, where "date" may be a calendar date or any other "tick of the clock"
(as described in Part 2).
2. Our convention is that the time period begins on the begin date, but ends one clock tick prior
to the end date.

Logical delete. A delete which marks the row as deleted, but leaves it in the table.

1. Also known as a “business delete,” or referred to as a “semantic delete” or a cognate.


2. See “physical delete”.

Episode. Each version Y of an object which does not supercede a version X current at the time Y
is created begins an episode of an object.

1. See “supercession.”
2. The first version of an object creates the initial episode of that object. That remains the

4 of 7 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 20 - ... http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

current episode until a logical delete version for that object is inserted into the table.
3. If, when a version Y is inserted into the table, the most recent version of that object already
in the table is a logical delete version, then Y creates a noninitial episode of that object. That
remains the current episode until a logical delete version for that object is inserted into the
table.
4. If, when a version Y is inserted into the table, the most recent version of that object already
in the table is a current version, then Y supercedes that version but remains within the same
episode as the version it supercedes.
5. Also:
a. Current episode. An episode of an object whose most recent version is not in the future
and is not a logical delete
b. Past terminated episode, terminated episode. An episode of an object whose most
recent version is not in the future and is a logical delete
c. Future episode. An episode of an object whose most recent version is in the future and
is not a logical delete
d. Future terminated episode. An episode of an object whose most recent version is in the
future and is a logical delete.

Object foreign key (OFK). A column in a versioned table which contains the object identifier used
by one or more rows in a (not necessarily distinct) version table.

1. For example, in Figure 2 of Part 20, client-nbr in the Policy table is an OFK. It is constrained
to have the value of a client-nbr in one or more rows in the Client table, such that at least
one of those rows was current as of the time the row containing the OFK was created.

Object referential integrity, object RI (revised). The constraint that when a row containing an OFK
is inserted, the value in that OFK matches the value of the object identifier portion of a primary key
in a (not necessarily distinct) version table.

1. Object RI cannot be enforced by today's RDBMSs. This follows from the fact that OFKs
reference an object only indirectly, by means of the one or more versions that implement it.
In those versions, the referenced object identifier is only part of the primary key of its table,
and is thus not necessarily (or even usually) unique.
2. Object RI requires that the referenced (parent) table be a versioned table. But the
referencing (child) table, the one that contains the OFK, may itself be either a versioned or
an object (non-versioned) table.

Object table. A table whose rows represent persistent objects. Sometimes called a
"non-versioned" table.

1. Persistent objects are things that exist over time and can change over time, such as vendors,
customers, employees, regulatory agencies, products, services, bills of material, invoices,
purchase orders, claims, certifications, etc.
2. From an online analytical processing, star-schema point of view, dimension tables are tables
of persistent objects. For more in-depth discussion of different types of tables, see the
articles "An Ontology of Tables" at MindfulData.com.
3. From an online transaction processing database point of view, assuming that there are no
version tables in the database, object tables are all the tables that are not transaction tables.
4. Roughly speaking, object tables are the tables which are the concern of master data
management.

Parent table, parent row. X is a parent table to Y if and only if there is a foreign key dependency

5 of 7 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 20 - ... http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

from Y to X. A row in X is a parent to a row in Y if and only if that row in Y has a foreign key whose
value is identical to the primary key value of that row in X.

1. Parent/child relationships typically have a maximum cardinality of one/many and a minimum


cardinality of optional for the parent and required for the child. But it is a matter of which
table contains the foreign key and which table is the reference of that foreign key that
differentiates parent from child tables and rows. Cardinality constraints are not what make
the difference.
2. Also sometimes referred to as “dependent-on tables,” or “dependent-on rows” various
cognates.

Physical delete. A delete which physically removes the row from its table.

1. See “logical delete.”

Queryable history. Data about an object that was valid at some time in the past, which is no longer
currently valid, but which is as easily and rapidly accessible as current data.

1. “As easily and rapidly accessible as current data” means what it says. Our way of providing
this access is to use version tables. In such tables, production queries against current data
(the most common kind of query) can be used to retrieve historical data simply by adding a
date to a BETWEEN clause of the SQL statement.
2. As we have mentioned before (and will again), providing queryable history, in this manner,
can significantly lower the development and operations cost of accessing historical data, and
significantly improve the availability of the historical data retrieved.

Standard referential integrity, standard RI. The referential integrity constraint among
non-versioned tables that today's RDBMSs can enforce. Because we are calling these
non-versioned tables "object tables," we erroneously equated object RI with standard RI in Part
19.

Supercede, supercession. In these articles, we use these terms to refer to the replacement of a
current version with a new current version.

1. Supercession is a logical function. Physically, supercession is done by inserting a new row in


a version table.
2. Deletion in a version table is done via supercession.
3. Versionable updates in a version table are also done via supercession.
4. However, creating the first version of an object does not involve supercession, as there is no
current version of that object to supercede.
5. A more subtle point: Creating the first version of a noninital episode of an object also does
not involve supercession. Even if other versions of an object exist, every episode is
concluded with a logical delete version. Thus, when a new episode is started, the most
recent prior version of that object is a logical delete version. Thus, there is no current version
of the object to supercede when that new episode begins.

Temporal referential integrity, temporal RI. Referential integrity for versions, which consists of
object RI plus the constraint that the effectivity time period for child rows are wholly contained with
the effectivity time periods of their parent rows.

Version, version table. A table whose rows represent versions of persistent objects.

6 of 7 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 20 - ... http://www.dmreview.com/dmdirect/2008_59/10000793-1.html?type=pri...

1. A version of a persistent object is a time-slice of that object. A row in a version table


represents a version, and describes that object as it was, is, will be and/or might be during
its effectivity time period.
2. A row in a version table is what the custodians of the table believe is the truth about the
object it represents during the indicated effectivity time period.

Wholly contained in (for effectivity time periods). Time period 2 is wholly contained in time period 1
if and only if the effectivity begin date of time period 2 is equal to or later than the effectivity begin
date of time period 1, and the effectivity end date of time period 2 is equal to or earlier than the
effectivity end date of time period 1.

1. In order to implement this constraint in today's production databases, we rely on SQL data
types and operators. Specifically, we rely on dates and/or timestamps, and the DBMS
implementation of comparison operators for those data types.
2. When we use "12/31/9999," we let the DBMS interpret that string as a valid date. However,
semantically, it almost never would be a valid date, because no business activities that we
know of are concerned with an effectivity time period that extends up to but not including or
beyond that date nearly eight-thousand years in the future.

Our ground-clearing that was needed before we discuss options for enforcing temporal RI is
hopefully now complete. If it is, Part 21 will begin our discussion of those options.

Beyond these issues, we must also discuss temporal RI as it applies to future versions, to
hypothetical versions and also to versions that correct errors in other versions. This will take us
several installments further into this series, at which point we will resume our presentation of
physical implementations of Version Patterns 4 and 5.

After that, we will begin our discussion of Version Patterns 6 and 7.

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

7 of 7 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 21 - ... http://www.dmreview.com/dmdirect/2008_62/10000912-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 21 - Enforcement
Time Options for Temporal RI
Tom Johnston, Randall Weis
DM Direct, March 7, 2008

At the conclusion of our previous article, we said that there are two options for enforcing temporal
integrity constraints on a database. One is to enforce the constraints at the time of update so as to
prevent violations from being persisted in the database. The other is to let violations occur, but to
filter them out at retrieval time. This would be done by means of views against which all normal
queries would be written.

It's easy to disparage the latter option by describing it as letting temporal "garbage" onto the
database, and then trying to hide that garbage from queries. But this is indeed the approach we
have taken on temporal databases that we have designed, built, put into production and
supported, and we had good reason for doing so. Like almost all database professionals, we were
concerned that there would be an overwhelming volume of cascade updates if we tried to enforce
temporal RI at time of update, especially when the update involves the delete of a versioned object
that was the "parent" of possibly many other objects, in many other tables. (See the introduction
to Part 10 for a more complete description of this "cascade update" fear.)

Intuitive Data Reporting & Dashboards


Get instant visibility into your business with clickable access to your
data no matter where it resides. Customize reports & dashboards on
the fly.Try PivotLink now.

We believe it is time to re-examine this fear, and see if the performance hit for enforcing temporal
RI on update transactions is as great as we have feared. We assume here that for as long as data
storage continues to involve spinning disks, cylinders and mechanical movement of read/write
heads, to get to required data, we can safely simplify our performance issue and treat it as a
matter of how many physical I/Os, how many block-level transfers between a storage device (or its
cache) and main memory, are required to complete a temporal RI validation. With this in mind, let
us now examine two things – the logic of transactions against versioned objects, and the I/O that
is required for those transactions.

Temporal RI Reformulated

To more explicitly allow for the query-time enforcement of temporal integrity constraints, we will
slightly amend the definition of temporal RI given in Part 20, as follows:

If there is a temporal RI dependency from table Y to version table X (not necessarily distinct), then
no exposed state of the database is valid in which any row in table Y is not object-RI linked to a
row in table X, or in which the effectivity time period of a row in table Y is not wholly contained
within the effectivity time period of its parent row in table X. No queries are valid which would

1 of 5 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 21 - ... http://www.dmreview.com/dmdirect/2008_62/10000912-1.html?type=pri...

expose such a state.

There are three changes to the original definition:

1. "Version table Y" changed to "table Y." As explained last time, we allow both versioned and
non-versioned tables to be object-RI dependent on a versioned table.
2. "State of the database" changed to "exposed state of the database." This allows for the
option of enforcing temporal RI on queries rather than on updates.
3. "No inserts, updates or deletes are valid that would transform a database into such a state"
replaced by "No queries are valid that would expose such a state." The replaced sentence
was appropriate for update-enforced temporal RI. The replacing sentence is appropriate for
either update- or query-enforced temporal RI.

We will assume, in the following discussion, that we are dealing with update-enforced temporal RI.
In later articles, we will discuss query-time enforcement.

Note: before proceeding, please note that the glossary introduced a couple of installments ago is
not included in this article, nor will it be included in future articles. The reason is that the glossary
is already over 3,000 words long and likely to grow to several times that size.

One reason for the glossary being this extensive is that we are making the definitions of its entries
extensive. Most of these concepts are complex, and we don't think it is possible to do them justice
with one- or two-sentence definitions.

Another reason for the glossary being this extensive is that we are attempting to construct it as a
controlled vocabulary. By that we mean that any expression involving temporal concepts which is
used in a definition must itself be a defined glossary entry. This is the kind of rigor that must be
applied to definitions before they can be "ontologized," i.e., expressed in a formal notation like
predicate logic and thus available for manipulation by software-realized inferencing engines. It is
also the kind of rigor that must be applied (to any set of technical definitions) before we can say
with confidence that we know what we mean.

Of course, all this just makes the glossary that much more important than if its entries were brief
and informal. So we have decided to put the glossary on each of our Web sites, to keep it updated
there, and to include a link on our respective homepages to that glossary. The URLs are
MindfulData.com and InbaseInc.com.

In this review of updates to versioned tables, our approach will be to first describe the logic of
those updates, including the enforcement of temporal constraints. After that, we will go on to
consider the physical I/O costs of enforcing those constraints on those updates. The specific types
of update that we will consider are:

Deletes;
Inserts;
(Row-level) updates; and
Upserts.

We begin with some terminology. An "original" transaction is a transaction created or authorized


by the business. A "temporal" transaction is the result of translating an original transaction into a
transaction against a versioned object. As we will see, while each original transaction appears to
its author to be an update against a single row of an object table, the translation may result in a

2 of 5 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 21 - ... http://www.dmreview.com/dmdirect/2008_62/10000912-1.html?type=pri...

set of updates to multiple rows in possibly multiple tables. (The original/temporal distinction, and
other key terms of our temporal data management approach, are defined in the glossary.)

Original Deletes: Their Semantics

To the business, when they are issuing a delete against an object, they don't care if the object is
versioned or not, and indeed may not know. They are directing us to remove the object from the
database. Thus, an original delete transaction doesn't need to specify a version, just the object
being deleted and the date the delete is to take effect. This is illustrated in Figure 1.

Figure 1: An Original Delete Transaction

Let's organize this discussion around three questions.

1. First, when an object is versioned, and a business user issues or authorizes a delete, what
does he think he is deleting, an object or a version?
2. Second, when an object is versioned, and a business user issues or authorizes a delete,
what is actually deleted?
3. And finally, how is the actual delete transaction carried out?

As for the first question, the author of the delete transaction intends to delete an object, and
believes he is doing so. For example, he may issue a delete of a specific client, against a database
in which clients are versioned, but he does not issue a directive to delete a specific version of that
client.

The reason that original update transactions are specified against objects, and not against
versions, is that the burden of managing versions should not be placed on the business user, in
this case on the author of the delete transaction. Versions are simply an internal mechanism for
satisfying the business demand for real-time access to noncurrent data, which is just as fast and
easy to access as current data. As far as possible, versions and their complexities should be
hidden from both end users and developers.

A delete transaction against a versioned object is valid if and only if there is a current version of
that object on the database at the time of the transaction. Necessarily, this version will always be
the one current version of the one current episode of that object, and applying the delete will
delete the object by terminating its current episode.

But what about past versions, future-dated versions or hypothetical versions? As for past versions,
it is reasonable to expose semantics to the user that prohibits them from updating the past. Just
as with good accounting practice, if what we recorded about the past was wrong, we do not
update that original entry. Instead, we enter adjusting transactions.

As for future or hypothetical versions, it could be said that if we allow users to create them, as we
in fact do, then we expose the mechanism of versioning to those users. But this isn't so.
Consistent with our principle of hiding the mechanism of versioning as much as possible, we do
not expose to the user or developer the concept of deleting something that doesn't yet exist.

3 of 5 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 21 - ... http://www.dmreview.com/dmdirect/2008_62/10000912-1.html?type=pri...

Instead, all that we have exposed to the user is functionality, the ability to insert transactions in
advance of when they are to become effective, or even transactions that may never become
effective. The mechanisms by which we do this are not themselves exposed. From the user's point
of view, it appears as if we have simply accepted transactions in advance of when they should be
applied and stored them somewhere in a transaction file, with the promise that we will apply them
precisely on the date they are to become effective.

Temporal Deletes: Their Targets

So, given that there must be a current version of the object of an original delete, and there may
be, in addition, past, future and hypothetical versions as well, what is actually deleted? This is
determined by the code that transforms an original delete into a temporal delete. We may think of
this code as the "transform" portion of an extract, transform and load (ETL) process against a
batch file of original transactions. We may also think of this code as a stored procedure invoked
against individual online transactions.

In either case, the important thing is that business users, developers and database administrators
(DBAs) all work with original transactions and not with temporal transactions. The code that
translates original to temporal transactions encapsulates the complexity of versions and their
manipulations, supporting their functional semantics without exposing their implementation
complexities.

This question about the targets of temporal transactions is the second of the three questions in
the previous section. The answer is: the current version, and any future and hypothetical versions
as well. To the user, it will seem that we have done three things:

Deleted the one row representing the object in question;


Flushed any remaining transactions for that object from the transaction file just mentioned;
and
If there were hypothetical versions, removed that object from one or more "hypothetical
situation" databases.

But how is this done? What physically happens when the object of an original delete is a
versioned object? The answer is that this is done by inserting a new version that supercedes the
current version of the object being deleted, as well as a new version for each of the future or
hypothetical versions of that object.

The process begins by finding the current version of the object specified on the original
transaction. If it is not found, the transaction is invalid. Otherwise, a superceding delete version is
created which has the same object identifier and business effectivity begin date as the version it is
superceding.

The effectivity end date on the superceding delete version is set to the delete date on the
transaction. This is because an original delete is a business transaction. Its intent is to remove the
object from the database as of the indicated date. For example, if a client is being deleted, the
intent is that after the transaction is complete, that client no longer exists, i.e., that person or
organization no longer exists as a client of our company. (We may assume that if a delete date is
not supplied on the original transaction, the effectivity end date defaults to the date current when
the transaction is physically applied to the database.)

On these versions created by an original delete transaction, a delete flag is set to "yes." For all but

4 of 5 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases, Part 21 - ... http://www.dmreview.com/dmdirect/2008_62/10000912-1.html?type=pri...

these delete-created versions, this flag is set to, and remains, "no." The purpose of this delete flag
is to distinguish deletes that happen in response to an original delete transaction from deletes that
happen because a non-12/31/9999 end date comes to pass.

Note that nowhere in this process have we discussed a physical deletion of a physical row in a
version table. Physically removing rows from versioned tables is a process we will call "archiving"
those rows, and it is outside the scope of what this series of articles is about. See the taxonomy in
Part 1 for additional discussion of the difference between archiving, creating snapshots, and
versioning.

Wrap-Up

Next time, we will finish our discussion of original and temporal delete transactions by examining
the foreign key implications of deleting a versioned object. We will then be in a position to access
the risk and the extent of potential "cascade" effects of such deletions. Since what we are
proposing is an enterprise solution, we must either evaluate performance implications against
worst-case scenarios, or else design our original transactions with a flag that can selectively turn
on or off transaction-time temporal validation checks.

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

5 of 5 10/7/2008 1:14 PM
Time and Time Again: Managing Time in Relational Databases - Part 22: ... http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases - Part 22: Original and
Temporal Deletes
Tom Johnston, Randall Weis
DM Direct, March 21, 2008

Note: A glossary of technical terms used in these articles can be found on the Web sites of the
authors. The URLs are MindfulData.com and InbaseInc.com.

This glossary is being constructed as a controlled vocabulary. By that we mean that any expression
involving temporal concepts which is used in the definition of a glossary entry must itself be a
glossary entry.

Intuitive Data Reporting & Dashboards


Get instant visibility into your business with clickable access to your
data no matter where it resides. Customize reports & dashboards on
the fly.Try PivotLink now.

One reason for being this careful with definitions is that this kind of rigor must be applied to
definitions before they can be "ontologized," i.e. expressed in a formal notation like predicate logic.
That, in turn, is what is needed to make definitions available for manipulation by software-realized
inferencing engines.

Another reason for being this careful with definitions is that this is the kind of rigor that must be
applied to any set of technical definitions before we can say with confidence that we know what we
mean, and that we mean what we say, no more and no less.

In our previous article, we introduced the distinction between original and temporal transactions,
and began by discussing original and temporal deletes. This article completes our discussion of
delete transactions. We will then go on to discuss insert, update and upsert transactions.

These discussions describe the transformations required to map original transactions onto
temporal transactions. This mapping is the "database update side" of the encapsulation of the
complexities of temporal data management which, in Part 1, we said was an essential part of our
approach.

These same discussions also provide the insight necessary to decide whether we should attempt
to enforce temporal referential integrity (RI) on the updates themselves, or instead defer that
integrity checking until data is retrieved. With respect to the deferral option, we will repeat what we
said about it in the previous article:

It's easy to disparage [the deferral option] by describing it as letting temporal "garbage" onto the
database, and then trying to hide that garbage from queries. But this is indeed the approach we
have taken on temporal databases that we have designed, built, put into production and

1 of 5 10/7/2008 1:15 PM
Time and Time Again: Managing Time in Relational Databases - Part 22: ... http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

supported, and we had good reason for doing so. Like almost all database professionals, we were
concerned that there would be an overwhelming volume of cascade updates if we tried to enforce
temporal RI at time of update, especially when the update involves the delete of a versioned object
that was the "parent" of possibly many other objects, in many other tables. (See the introduction
to Part 10 for a more complete description of this "cascade update" fear.)

From the Previous Article

For the sake of continuity, we repeat the following material from the previous article.

If there is a temporal RI dependency from table Y to version table X (not necessarily distinct), then
no exposed state of the database is valid in which any row in table Y is not object-RI linked to a
row in table X, or in which the effectivity time period of a row in table Y is not wholly contained
within the effectivity time period of its parent row in table X. No queries are valid which would
expose such a state.

An "original" transaction is a transaction created or authorized by the business. A "temporal"


transaction is the result of translating an original transaction into a transaction against a versioned
object.

To the business, when they are issuing a delete against an object, they don't care if the object is
versioned or not, and indeed may not know. They are directing us to remove the object from the
database. Thus, an original delete transaction doesn't need to specify a version, just the object
being deleted and the date the delete is to take effect. This is illustrated in Figure 1.

Figure 1: An Original Delete Transaction

Deleting a Versioned Object: The Existence Constraint

The original transaction we are considering is a delete. Just as with a delete to a non-versioned
object, the first thing that must be done is to find the target row(s). One target row must be the
most recent row of the current episode of the object. The original transaction provides the
electronic identifier (EID) for the object. The most recent version of the current episode of the
object is the one (nonhypothetical) version of that object with an effectivity begin date in the past
and an effectivity end date either in the future, or containing the "temporal null" value "12/31
/9999."

First of all, note the assumption that there cannot be more than one current version of an object in
a versioned table. We recommend that you attempt to state, clearly and precisely, what
guarantees that this assumption is valid. (Hint: Consider the logic of inserting into a versioned
table. In the next article, this logic will be described in terms of the mapping from an original insert
to a temporal insert, but in both the glossary and in previous articles, the material already exists
for you to answer this question.)

If a current version of the indicated object is found, the delete can proceed. Otherwise, it is
rejected as invalid. (We consider deletes which are explicitly against future versions in a later
article. For now, we assume that all original transactions are against current versions.)

2 of 5 10/7/2008 1:15 PM
Time and Time Again: Managing Time in Relational Databases - Part 22: ... http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

As described last time, an original delete of a versioned object requires us to supercede the most
recent version of the current episode of the object with a delete version, and also to do the same
for any active future (or hypothetical) episodes of that object. An active future episode is a future
episode whose most recent version is not a delete version.

Deleting a Versioned Object: What is a Dependent Row?

If the version is found, the next thing is to look for any rows, in either object tables or version
tables, that are dependent on it. This dependency is precisely that of temporal RI.

Recall as well that the database management systems (DBMS) cannot help us here. If we are to
enforce temporal RI on delete transactions, we must do the work ourselves. And since this is work
to enforce a database constraint, it does not belong anywhere in application-specific code. It must
be implemented in the codebase that maps original transactions to temporal transactions. (This
codebase, of course, may be expressed in part, or even in its entirety, as DBMS triggers.)

If the child table is an object table, a dependent row is one with the to-be-deleted row's EID as an
object foreign key (OFK). Performance considerations would thus suggest that OFK columns, in
all non-versioned tables, have a non-unique index defined on them.

If the child table is a versioned table, a dependent row in a current episode of its object is one
which has an OFK that contains the value of the to-be-deleted row's EID, and with an effectivity
end date either in the future, or else containing the value "12/31/9999". A dependent row in an
active future episode of its object is the most recent version of that episode. (Until further notice,
we will not refer to hypothetical objects, versions or episodes any longer. The approach to
temporal data management that we are describing does accommodate such hypotheticals, but
they require special treatment. They will be discussed in later articles.) In short, dependent rows
for a temporal delete are all and only those rows which are OFK-related to the to-be-deleted row,
and which are the most recent versions in the active episodes for their object.

Here, performance could be improved by using a non-unique unique index on OFK plus effectivity
begin date (plus one more date, as we shall see when we discuss Version Pattern 7) in the
dependent table. However, at this point we are getting into a detailed discussion of implementation
options. We intend to discuss such issues in depth toward the end of this series, and so we will
say no more about them here.

Deleting a Versioned Object: Block, Set Null and Cascade Options

The search for dependent rows is conducted under one of three directives. They are the familiar
relational delete options of:

Block,
Set null, or
Cascade.

From a performance point of view, which, as we stated last time, means from a physical I/O point
of view, the block option is often less costly than the other options. This is because if there are
multiple rows dependent on the to-be-deleted current version, we can stop checking as soon as
the first one is found.

3 of 5 10/7/2008 1:15 PM
Time and Time Again: Managing Time in Relational Databases - Part 22: ... http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

But if either the set null or the cascade option is chosen, then our search for temporally RI
dependent rows in versioned child tables must be exhaustive. It must retrieve and examine all
OFK-related rows.

Notice that even if the child table index used for the search is as described above, the rows
themselves must be retrieved, because the check must look at the effectivity end date of the child
rows. In very large tables, possibly with multiple OFKs, the cost of such indexes can be high.
When an important to-be-deleted object has many dependent children, the I/O costs of a)
searching both the index and the base table for possible episodes to terminate (i.e., possible most
recent versions to supercede with temporal delete versions), and b) doing the appropriate writes to
the database to terminate those episodes, may be prohibitively high.

But these are simply the trade-offs that any database administrator (DBA) must consider when
physicalizing any data model. It is the cumulative weight of such costs that can lead us to decide
to postpone temporal RI checking until retrieval time. In the case of temporal deletes, it would
mean that we would just terminate all active episodes of the designated object, and not check for
any dependencies.

Deleting a Versioned Object: The Foreign Key Ripple Effect

We know that normal, DBMS-enforced RI doesn't just link child tables to parent tables. When
using the set null or cascade options, we must also consider child tables to those child tables, and
continue down the RI-dependency chain as far as it extends. Thus, in a situation where we have
clients, policies and claims, a cascade or set null delete of a client may require one or more
policies to be set null or deleted. And if any are, the DBMS must go on to set null or delete the
claims dependent on those policies.

Let's call this the "foreign key ripple effect." There are three things to note about this ripple effect
as it applies to versioned objects.

1. It does apply to versioned objects.


2. The DBMS can't help us. We must write our own code to enforce temporal RI on temporal
delete transactions.
3. The performance cost will usually be much higher than for normal, DBMS-enforced RI.

So if we are enforcing temporal RI as the database is updated, the index and base-table lookups
that we described above must be performed. In addition, as we just pointed out, they must be
performed on all tables in the temporal RI chain of dependencies, thus on not just children of the
to-be-deleted parent row, but also of grandchildren, great-grandchildren, etc.

What of the claim that performance costs will usually be much higher than for normal RI? A
detailed defense of the claim, of course, cannot be provided unless we describe a specific
implementation. But this would not prove that some better implementation might keep down the
incremental performance costs. However, regardless of the specific implementation, they all share
the following characteristic: there are going to be a lot more rows in a versioned table than there
would be in the corresponding object table. The reason is that in versioned tables, it is not just
inserts that create new rows. Versionable updates do also, and versioned deletes do also.

Deleting a Versioned Object: Temporal Delete Transactions

Let's suppose that we have identified both the most recent version of the current episode of the

4 of 5 10/7/2008 1:15 PM
Time and Time Again: Managing Time in Relational Databases - Part 22: ... http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

object specified on the original delete, the most recent version of any active future episodes of that
object, and also the most recent version of all active episodes in all tables along the temporal
RI-dependency chain. We then add a temporal delete transaction to each of these episodes,
superceding the most recent version in each of them with a temporal delete version.

Note that these temporal deletes do not flag and rewrite these versions. That would be to do an
update in place logical delete. As we explained in an earlier article, update in place logical deletes
lose information, and that information may be considered versionable information by the business.

Temporal deletes insert a temporal delete version for each active episode. The effectivity end date
on the delete version is set to the delete date on the transaction. This is because an original delete
is a business transaction. Its intent is to remove the object from the database, as of the indicated
date. For example, if a client is being deleted, the intent is that after the transaction is complete,
that client no longer exists, i.e., that person or organization no longer exists as a client of our
company. (We may assume that if a delete date is not supplied on the original transaction, the
effectivity end date defaults to the date current when the transaction is physically applied to the
database.)

Note also that nowhere in this process have we discussed a physical deletion of a physical row in
a versioned table. Physically removing rows from versioned tables is a process we will call
"archiving" those rows, and it is outside the scope of what this series of articles is about. See the
taxonomy in Part 1 for additional discussion of the difference between archiving, creating
snapshots, and versioning.

So just like original inserts and original (versionable) updates, original deletes to versioned objects
become physical inserts or one or more rows.

Wrap-Up

There is one unfinished piece of business. Having argued that sometimes it is too performance
costly to enforce temporal RI on original delete transactions against versioned objects, we should
now show how that same RI is enforced if it is, instead, enforced at retrieval time. However, with
deadlines approaching on our bill-paying projects, we are forced to postpone that discussion for a
later time.

Next time, we will provide the same treatment for original insert transactions against versioned
objects.

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

5 of 5 10/7/2008 1:15 PM
Time and Time Again: Managing Time in Relational Databases, Part 23 - ... http://www.dmreview.com/dmdirect/2008_66/10001077-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 23 - Original and
Temporal Updates
Tom Johnston, Randall Weis
DM Direct, April 4, 2008

Note: A glossary of technical terms used in these articles can be found on the Web sites of the
authors. The URLs are MindfulData.com and InbaseInc.com.

This glossary is being constructed as a controlled vocabulary. By that we mean that any expression
involving temporal concepts used in the definition of a glossary entry must itself be a glossary
entry.

Intuitive Data Reporting & Dashboards


Get instant visibility into your business with clickable access to your
data no matter where it resides. Customize reports & dashboards on
the fly.Try PivotLink now.

One reason for being this careful with definitions is that this kind of rigor must be applied to
definitions before they can be "ontologized," i.e., expressed in a formal notation like first-order
logic. That, in turn, is needed to make definitions available for manipulation by software-realized
inferencing engines.

Another reason for being this careful with definitions is that this rigor must be applied to any set of
technical definitions before we can claim not only that we say what we mean, but also that we know
what we mean when we say it.

In our previous three articles, we introduced the distinction between original and temporal
transactions. We then reviewed temporal referential integrity (RI) rules as they apply to temporal
deletes. This article continues the discussion by reviewing temporal RI rules as they apply to
temporal updates.

These same discussions also provide the insight necessary to decide whether we should attempt
to enforce temporal RI on the updates themselves, or instead defer that integrity checking until
data is retrieved. With respect to the deferral option, we will repeat what we said about it in the
previous article:

It's easy to disparage [the deferral option] by describing it as letting temporal "garbage" onto the
database, and then trying to hide that garbage from queries. But this is indeed the approach we
have taken on temporal databases that we have designed, built, put into production and
supported, and we had good reason for doing so. Like almost all database professionals, we were
concerned that there would be an overwhelming volume of cascade updates if we tried to enforce
temporal RI at time of update, especially when the update involves the delete of a versioned object
that was the "parent" of possibly many other objects, in many other tables. (See the introduction

1 of 4 10/7/2008 1:16 PM
Time and Time Again: Managing Time in Relational Databases, Part 23 - ... http://www.dmreview.com/dmdirect/2008_66/10001077-1.html?type=pri...

to Part 10 for a more complete description of this "cascade update" fear.)

For the sake of continuity, we repeat the following definitions which are especially important for
understanding the material in this article. These, and other technical terms, are all defined in the
glossary.

Temporal Referential Integrity: Definition

If there is a temporal RI dependency from table Y to version table X (not necessarily distinct), then
no exposed state of the database is valid in which any row in table Y is not object-RI linked to a
row in table X, or in which the effectivity time period of a row in table Y is not wholly contained
within the effectivity time period of its parent row in table X. No queries are valid which would
expose such a state.

Original and Temporal Transactions: Definitions

An "original" transaction is a transaction against a versioned object.

A "temporal" transaction is the result of translating an original transaction into one or more
transactions against one or more episodes of that versioned object.

Updating a Versioned Object: the Original Update Transaction

To the business, when they are issuing an update against an object, they don't care if the object is
versioned or not, and, indeed, they may not know. They are directing us to apply the specified
change to the object. Thus, an original update transaction doesn't need to specify a version, just
the object being updated and the date the update is to take effect. This is illustrated in Figure 1.

As we have pointed out before, we are using dates throughout these articles only for convenience.
The approach to versioning that we are presenting here applies no matter what the granularity of
the clock ticks that measure out time. The granularity used here would be the correct granularity
only for that updates take place in batch mode and are applied at most once per day. As soon as
online transactional updates are involved, however, we would need a finer granularity, such as a
full timestamp.

Figure 1: An Original Update Transaction

Update transactions can be submitted as either complete images of the updated row or else by
specifying first the row, and then the column/value pairs for each column to be updated in that
row. If these transactions are to become effective exactly when they are applied, the effectivity
begin date in Figure 1 is not required. In those cases, an original update transaction against
versioned data will be identical to the same update submitted against non-versioned data.

Updating a Versioned Object: the Existence Constraint

The original transaction we are considering is an update. Just as with an update to a


non-versioned object, the first thing that must be done is to find the target row(s). Given that the
target table is a versioned table, one target row must be the most recent version of the current

2 of 4 10/7/2008 1:16 PM
Time and Time Again: Managing Time in Relational Databases, Part 23 - ... http://www.dmreview.com/dmdirect/2008_66/10001077-1.html?type=pri...

episode of the object. The original transaction provides the EID for the object. The most recent
version of the current episode of the object is the one version of that object with an effectivity
begin date in the past and an effectivity end date either in the future, or containing the "temporal
null" value "12/31/9999."

First of all, note the assumption that there cannot be more than one current version of an object in
a versioned table. This assumption is guaranteed by the logic of inserts and deletes against
versioned objects. This reflects our commonsense belief that at any point in time, there cannot be
two different things that are true about an object.

When a user submits an update transaction against a non-versioned table, she is asserting that
the indicated object already exists in that table. If the object doesn’t already exist, the transaction
is invalid. When a corresponding original transaction is submitted, i.e., when the update will be
applied to a versioned table, the update logic must also ensure that the object already exists in
that table. Specifically, the update logic must ensure that the object exists right now. Just as with
non-versioned tables, it's not enough that the object may have existed in the table at some point in
the past or will exist at some point in the future. It must exist right now. There must also be only
one row in the table representing that object. In a versioned table, that means there must be only
one current version of the object, where a current version is one whose effectivity begin date is in
the past and whose effectivity end date is not.

If a current version of the indicated object is found, the update can proceed. Otherwise, it is
rejected as invalid. (We consider updates which are explicitly against future versions in a later
article. For now, we assume that all original transactions are against current versions.)

An original update of a versioned object requires us to supercede the most recent version of the
current episode of the object (which is, by definition, the current version of the object) with an
updated version, and also to do the same for any active future episodes of that object. An active
future episode is a future episode whose most recent version is not a delete version.

Updating a Versioned Object: Temporal Update Transactions

Let's suppose that we have identified the most recent version of the current episode of the object
specified on the original update, and also the most recent versions of any active future episodes of
that object. We then add a temporal update transaction to each of these episodes, superceding
the most recent version in each of them with a new most recent version.

Note that these temporal updates do not flag and rewrite these versions. That would be to do an
update in place, which defeats the whole purpose of versioning by overwriting the past.

Temporal updates insert a new most recent version for each active episode. The effectivity end
date on the pre-update most current version is set to one click tick prior to the effectivity begin date
on the version that is superceding it.

This is because an original update is a business transaction. Its intent is to place the updated
information in the database, as of the indicated date. For example, if a client is being updated, the
intent is that after the transaction is complete, the current description of that client is different than
it was before the update, reflecting new information we have come across about the current
description of that client.

But what about temporal RI? If we supercede a version of a client, what about the policies for that

3 of 4 10/7/2008 1:16 PM
Time and Time Again: Managing Time in Relational Databases, Part 23 - ... http://www.dmreview.com/dmdirect/2008_66/10001077-1.html?type=pri...

client that are temporally RI dependent on the client? And what about the claims for those
policies that are temporally RI dependent on them?

Updating a Versioned Object: the Concern

As we’ve mentioned before, a pervasive fear among business IT professionals is that temporal RI
enforcement on updates raises the likelihood of an extensive cascade effect, one we have called
the “foreign key ripple effect” in other articles.

We know that normal, database management systems (DBMS)-enforced RI doesn't just link child
tables to parent tables. When using the set null or cascade options, we must also consider child
tables to those child tables, and continue down the RI-dependency chain as far as it extends.
Thus, in a situation where we have clients, policies and claims, a cascade or set null delete of a
client may require one or more policies to be set null or deleted. And if any are, the DBMS must go
on to set null or delete the claims dependent on those policies.

If these RI checks must be done on normal tables whenever an object is deleted, don't temporal
RI checks need to be done on versioned tables whenever a version of an object is superceded?

Contrary to a widespread belief among data management professionals, the answer is that they
do not need to be done whenever a version of an object is superceded. To understand why this
fear is largely unfounded, note first of all that we implement temporal RI with object foreign keys
(OFKs), not with normal DBMS-enforced FKs. OFKs in a temporally dependent table do not point
to a specific parent row in the versioned table they are dependent on. They point only to the object
they are dependent on. Thus, in a Policy table, each row has an OFK to the client whose policy
they are. But the table the Policy table is temporally RI dependent on is a versioned table, and
thus an OFK does not pick out a specific row in that table.

Initially, we might want to conclude that temporal updates have no effect on temporal RI at all, and
that just like with normal RI, a temporal update does not require any RI checking at all. However,
this is not so. Most temporal updates have no effect on temporal RI, and thus incur no RI-related
performance penalties. But there is one kind of temporal update that does require checking of
dependent tables, and that may involve a lengthy chain of dependencies that must be checked so
that the temporal update can either be blocked, or trigger a cascade of temporal updates, or
trigger a cascade of temporal set nulls to the OFKs in the chain.

We will examine this temporal RI-triggering original update in Part 24. In the meantime, we
recommend that our readers think about temporal updates, and try to figure out for themselves
what kinds of temporal updates can trigger a temporal RI cascade and what kinds cannot.

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

4 of 4 10/7/2008 1:16 PM
Time and Time Again - Managing Time in Relational Databases, Part 24: ... http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

Time and Time Again - Managing Time in


Relational Databases, Part 24: Original and
Temporal Updates (Concluded)
Tom Johnston, Randall Weis
DM Direct, April 18, 2008

Note: A glossary of technical terms used in these articles can be found on the websites of the
authors. The URLs are MindfulData.com and InbaseInc.com.

This glossary is being constructed as a controlled vocabulary. By that we mean that any expression
involving temporal concepts which is used in the definition of a glossary entry must itself be a
glossary entry.

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

One reason for being this careful with definitions is that this kind of rigor must be applied to
definitions before they can be "ontologized," i.e., expressed in a formal notation like first-order
logic. That, in turn, is what is needed to make definitions available for manipulation by software-
realized inferencing engines.

Another reason for being this careful with definitions is that this is the kind of rigor that must be
applied to any set of technical definitions before we can claim not only that we say what we mean,
but also that we know what we mean when we say it.

As this series continues, context becomes increasingly important so that the thread of the
discussion is not lost. We begin by updating our chart of installments, which last appeared in Part
13.

(See PDF at the end of this article for a chart of installments to date in this series.)

With Part 19, we interrupted our presentation of the physical implementation of the versioning
patterns that we developed in earlier articles in the series and began a discussion of temporal
referential integrity (TRI). On completing Part 23, we had introduced and defined TRI and related
concepts, and discussed TRI as it applies to deletes and updates. We now turn to TRI as it
applies to inserts.

Updating a Versioned Object: When TRI Checking Does Not Come Into Play

Last time, we described a temporal update in which a current version of a client was superceded
by a new current version in a client version table. We also alluded to a TRI dependent policy
version table, in which every policy was owned by exactly one client. So if the client version that

1 of 5 10/7/2008 1:17 PM
Time and Time Again - Managing Time in Relational Databases, Part 24: ... http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

was superceded owned one or more policies, doesn't that mean that we must either block the
supercession, set null the references in the policies, or else cascade update them to point to the
new version of the client?

To raise the question is to begin to see the outline of the answer to it. Those policies –versioned or
not – do not point to the superceded client version, because versioned tables are not referenced
by foreign keys; they are referenced only by object foreign keys (OFKs), which point to the
referenced object but do not point to any specific version of it.

Let's think about the situation in terms of what is going on in the real world, not simply in terms of
what is going on in the database. In the real world, policies are owned by clients. As those policies
change over time, or as the client that owns them changes over time, those claims are still owned
by those clients. This last sentence, translated into the language of database activity against
versioned tables, reads like this: even as versions of those policies are superceded by newer
versions, or as versions of the client that owns them are superceded by newer versions, those
policies are still owned by those clients. Versioning of either or both objects has no effect on the
ownership relationship between them.

But because we are dealing with a temporally qualified relationship, there is a proviso. Versioning
of either or both objects has no effect on the ownership relationship between them, provided that
the superceding version of a TRI child (parent) object has an effectivity period that is wholly
contained within (wholly contains) the effectivity period of all objects it is a TRI child of (a TRI
parent of).

Note that this proviso does not distinguish between TRI-mandatory and TRI-optional relationships.
It clearly applies to mandatory relationships, but at this time we are not sure whether or not it
should also apply to optional ones.

It's time now to see what the "provided that" clause entails.

Updating a Versioned Object: When TRI-Checking Does Come Into Play

Suppose we have a policy whose effectivity period is 1/1/2007 through 12/31/2007, and an owning
client whose effectivity period is 1/1/2004 through 12/31/9999. Although "12/31/9999" means "until
further notice," it is treated by the database management systems as a real date. Thus, the client's
effectivity period wholly contains that of the policy. This is shown in Figure 2.

Figure 1: A Client and Her Policy

The graphical conventions used in this illustration are:

1. The vertical bars constitute a timeline, with each bar representing one month. The timeline
runs from Jan 1, 2004 through Dec 31, 2009.

2 of 5 10/7/2008 1:17 PM
Time and Time Again - Managing Time in Relational Databases, Part 24: ... http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

2. A client and a policy episode are shown, each starting on its effectivity begin date.
3. The client episode contains four versions, the policy episode three.
All versions begin with left-most vertical bars.
Adjacent versions share a vertical bar.
Versions with a known effectivity end date have a right-most vertical bar indicating that
end date. Thus, the most recent version of the policy episode has a 12/31/2007
effectivity end date.
Versions whose effectivity end date is not known have an ellipsis in place of the
right-most vertical bar and always have an end date value of "12/31/9999."
We will call the former kind of episode "closed," and the latter kind "open".
4. {Now}, relative to the example, is indicated by the black box arrow, pointing to March 1, 2007.
So, for purposes of discussing the example, we should assume that it is now March 1, 2007.
5. The "OFK" labeled crow's foot notation indicates a TRI relationship from clients to policies.
Note in particular what this is not a relationship between. It is not a relationship between:
A specific client and a specific policy.
A specific version of the client and a specific version of the policy.
A specific episode of the client and a specific episode of the policy.
Just as a relationship type is defined on a pair of conventional tables, a temporal relationship
type is defined on a pair of tables, at least one of which is temporal.

As long as these effectivity periods do not change, and the policy continues to have an OFK to its
owning client, TRI constraints between these two rows remain satisfied. Thus, no updates that
leave this OFK and these pairs of dates alone can violate TRI, and therefore no versioning caused
by these updates needs to be accompanied by TRI checks. Consequently, there is no TRI
performance penalty in such cases.

Suppose that it is now some time during 2007, say March 1, 2007. And suppose that the client's
effectivity end date is moved back to 5/1/2007. If no other changes are made to that client, she will
be temporally deleted on that date, which will then terminate her current episode. In addition, all
versions of all future episodes will be temporally deleted. (Although we are postponing discussion
of how future episodes are managed, recall that, as we explained earlier, original transaction
authors will think of future episodes as future transactions that are queued up in a virtual
"transaction batch file," with the promise that they will be applied on their effectivity date. From this
point of view, temporally deleting all future episodes appears to be flushing those not-yet-applied
transactions from that virtual batch file.)

Because this new version of the client "shrinks" the client's effectivity time period, TRI checks are
required. In our example, there is a policy that is TRI dependent on that client, and that policy's
effective end date is 12/31/2007. Since 5/1/2007 is earlier than 12/31/2007, this violates the TRI
constraint that the effectivity period of a child version be wholly contained in the effectivity period
of the corresponding parent version. From the point of view of the real world, it means that if we let
the update proceed, the result would be a database which showed a policy in effect for six months
after its owning client was deleted. This situation, a violation of TRI, is shown in Figure 3. The
period of time during which TRI is violated is indicated by a red block.

3 of 5 10/7/2008 1:17 PM
Time and Time Again - Managing Time in Relational Databases, Part 24: ... http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

Figure 2: A TRI Violation

This is an example of a change to the effectivity period of a parent object. If that effectivity period
"shrinks," TRI checking is triggered on all its dependents to see that they are not "temporally
stranded" by the shrinkage. There is obviously a correlative process when an update changes the
effectivity period of a child object. In that case, if the effectivity period "expands," TRI checking is
triggered on all its parents to see that they do not "temporally strand" themselves by extending
their effectivity range beyond that of one of the objects they are dependent on.

On original updates to versioned objects, then, TRI checks must be done whenever there are TRI
dependencies, and effectivity periods are changed. Otherwise, original updates to versioned
objects do not trigger TRI checks, and thus do not incur any TRI performance penalty.

There is one more situation in which a temporal update may trigger TRI checking. That is when a
{null} OFK in the updated object takes on a real value, or when an existing OFK value is changed.
In either case, TRI checking is required to ensure that the effectivity timespan of a version of the
newly referenced object wholly encloses the effectivity timespan of the object whose OFK was
updated.

With conventional RI, constraints exist to ensure that after any database update, there are no
foreign keys with "dangling references," with values for which there is no corresponding primary
key in the referenced table. With TRI, constraints exist to ensure that after any database update,
there are no object foreign keys with "dangling references," with values for which there is no row in
the referenced table for that object whose effectivity timespan wholly encloses the effectivity
timespan of the referencing object.

New Terms Added to the Glossary

Adjacent versions.
Closed episode.
Dangling reference.
Expanding a time period.
Flushing the virtual batch file.
Open episode.
Shrinking a time period.
Temporally stranded objects.
TRI (as an acronym for "temporal referential integrity").
TRI mandatory relationship.
TRI optional relationship.
Virtual batch (transaction) file.

Our discussion of TRI, except for a consideration of how future versions are affected by it, is nearly
done. Next time, we will discuss TRI for the temporal transactions resulting from original inserts
and original upserts. Then we will go on to extend our discussion to consider the effect of TRI

4 of 5 10/7/2008 1:17 PM
Time and Time Again - Managing Time in Relational Databases, Part 24: ... http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

constraints on future versions of objects. As we shall see, those considerations are not
insignificant.

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

5 of 5 10/7/2008 1:17 PM
Time and Time Again: Managing Time in Relational Databases, Part 25 - ... http://www.dmreview.com/dmdirect/2008_70/10001237-1.html?type=pri...

Time and Time Again: Managing Time in


Relational Databases, Part 25 - Original and
Temporal Inserts
Tom Johnston, Randall Weis
DM Direct, May 2, 2008

Note: A glossary of technical terms used in these articles can be found on the Web sites of the
authors. The URLs are MindfulData.com and InbaseInc.com.

This glossary is being constructed as a controlled vocabulary. By that we mean that any expression
involving temporal concepts which is used in the definition of a glossary entry must itself be a
glossary entry.

Integrating Data: An Open Source Approach


Any technology deployed in the information system needs to
interoperate with existing applications or databases. Learn through
real-life scenarios how Open Source data integration solves the
interoperability challenge. Free White Paper.

One reason for being this careful with definitions is that this kind of rigor must be applied to
definitions before they can be "ontologized," i.e., expressed in a formal notation like first-order
logic. That, in turn, is what is needed to make definitions available for manipulation by software-
realized inferencing engines.

Another reason for being this careful with definitions is that this is the kind of rigor that must be
applied to any set of technical definitions before we can claim not only that we say what we mean,
but also that we know what we mean when we say it.

As this series continues, context becomes increasingly important so that the thread of the
discussion is not lost. Here is what we've done so far.

(See PDF below for Figure 1: Chart of Installments to Date in this Series.)

Beginning with Part 19, we interrupted our presentation of the physical implementation of the
versioning patterns, which we described earlier in the series, and began a discussion of temporal
integrity constraints. On completing Part 24, we had introduced and defined temporal RI and
related concepts, and discussed these constraints as they apply to delete and to update
transactions. In this and the following article, we turn to temporal integrity constraints as they
apply to insert and to upsert transactions.

The "ultimate" versioning pattern we are working toward is what we, following the computer
science community, call a "bi-temporal" pattern. In Parts 19 through 25, we have been discussing
both temporal entity integrity and temporal referential integrity constraints. But it is important to
note that these discussions have taken place in the context of a "uni-temporal" pattern. The only
time period considered has been an effective time period, and the only dates an effective begin

1 of 5 10/7/2008 1:17 PM
Time and Time Again: Managing Time in Relational Databases, Part 25 - ... http://www.dmreview.com/dmdirect/2008_70/10001237-1.html?type=pri...

and effective end date.

As we have seen, integrity constraints in a uni-temporal versioning context have been far from
simple. As we will see, integrity constraints in a bi-temporal versioning context are considerably
more complex than that.

Inserting a Versioned Object: the Original Insert Transaction

To the business, when they are issuing an insert of an object, they don't care if the object is
versioned or not, and indeed may not know. They are directing us to insert a row representing an
object which is not already represented by a row in the target table. Thus, an original insert
transaction doesn't need to specify a version, just the object being inserted, its business key,
whether or not match logic should be applied, and the date the insert is to take effect. This is
illustrated in Figure 2.

As we have pointed out before, we are using dates throughout these articles only for convenience.
The approach to versioning which we are presenting here applies no matter what the granularity of
the clock ticks that measure out time. The granularity used here would be the correct granularity
for transactions taking place in batch mode and being applied at most once per day. As soon as
online individually applied transactions are involved, however, we would need a finer granularity,
such as a full timestamp.

Figure 2: An Original Insert Transaction

Before we begin to examine an original insert transaction in detail, let's look specifically at the
"Match? (Y/N)" flag in Figure 2.

Inserting a Versioned Object: Haven't We Met Before?


Objects of interest to an enterprise frequently appear, go away, and then reappear later on. One
response to the fact of recurrence is to ignore it. When the object reappears, it is treated as a new
object, and no attempt is made to match it up with the history we have for it.

But enterprises generally prefer to recognize a recurring object rather than treat it as a new object.
For example, if the object is a customer, then if we don't attempt to determine whether or not she
used to be a customer of ours, we won't be able to greet her and make her feel special, like a
long-lost friend. We won't be able to say, "Welcome, back. Glad to have you as a customer once
again." We will also be unable to use the history of transactions we have with this customer to
more intelligently manage the reestablished relationship. We know the kinds of things she likes to
purchase from us. We may even know the kinds of advertising she is most responsive to. But
unless we can recognize her as a returning customer, we will lose the competitive advantage that
this past history can give us.

So the preferred response, if our target table is a versioned table, would be to look for a match of
the object being inserted with a past version. Assuming that the primary keys of rows in the target
table have not been back-propagated to their populating source systems, we must look for a
match using the source system business keys of the objects. If we find a match, we will use the
target table EID we found as the EID for the insert. This means, for a versioned target table, that it
will be used as the EID for the first version of a new episode of the object. With this EID firmly

2 of 5 10/7/2008 1:17 PM
Time and Time Again: Managing Time in Relational Databases, Part 25 - ... http://www.dmreview.com/dmdirect/2008_70/10001237-1.html?type=pri...

attached to all the versions of all the episodes of the object, we retain the information we can
glean from past episodes. We can say "Hi, and welcome back" to our returning customer. We will
already know what she likes, and what she doesn't like.

But notice the [Match? (Y/N)] flag on the original transaction. Why do we need it? If we keep the
business key on our versioned tables, and a business key is provided on the transaction, why
can't we automatically check for a match?

In a perfect world, with objects that are assigned unique business keys, we would not need this
match flag. But in the imperfect world of often dirty data that we IT professionals must manage,
sometimes we can't rely on the business key. Sometimes we must even allow duplicate business
keys, or keys with default values or {null}s in one or more columns, into our target tables.

For example, multiple independent source systems might provide insert, update and delete
transactions to the same target table. The same business key might be used in both systems, and
the two rows thus designated might or might not represent the same object. One way we could
find ourselves in this situation is if the two source systems were, up to now, managed rather
"loosely," as systems for quasi-independent organizations. Mergers and acquisitions often start off
with this "loosely coupled" approach because it is much quicker and much easier to bring online
than full integration. But when the customer bases of the acquiring and the acquired companies
overlap, as they often will, it is inefficient to manage the customers in common as different
customers. It is often an inefficiency visible to those customers as well, whose impressions of our
company will certainly not be enhanced by this annoying inability of ours to recognize them as one
and the same customer.

Another example would be overlapping product lines across the manufacturing plants of an
acquiring and an acquired company. Here, too, there is a loosely coupled and a tightly coupled
approach. Here, too, moving from the former to the latter will likely uncover latent business key
collisions and make them fully explicit.

These M&A scenarios are often exacerbated by the not infrequent practice of using a system-
generated sequence number as the last column of a business key. When this happens, the odds
of identical cross-system business keys representing different objects quickly escalates. For
example, if each of two systems has three business keys, each set identical except for the
sequence number portion, and both sets identical in the non-sequence number portion, and if we
assume that there are indeed exactly three customers involved, the odds of matching them
correctly across the two systems is nine to one, against!

Another way business keys can be compromised is when a source system uses a surrogate key,
one which, for example, recycles every couple of years. By the time it recycles, all reference to the
object it originally represented will have been archived out of that system. But if our target is a
versioned table with a longer online lifecycle than the source table, then when the reused
surrogate appears, it may still exist in the target table as an identifier for the previous object it
designated.

In these ways, or in any of the other ways in which dirty data can make its way into the very
business keys that identify the objects we are interested in, if we automatically search for a match
on original insert transactions, we may inadvertently create synonyms, multiple rows which
represent the same object. For this reason, original inserts must indicate whether or not match
logic should be applied to them. But because the same semantics often need to be provided even
when the target table is not versioned, requiring the match flag does not "tip our hand" and

3 of 5 10/7/2008 1:17 PM
Time and Time Again: Managing Time in Relational Databases, Part 25 - ... http://www.dmreview.com/dmdirect/2008_70/10001237-1.html?type=pri...

indicate that the target is a versioned table. The distinction between versioned and conventional
tables remains hidden from the users and IT personnel who provide the source transactions.

Unreliable Business Keys: Why Assume a No-Match?

If we always assume a no-match when dealing with unreliable business keys, the effect is to turn
every source update transaction into an insert. The result quickly becomes a situation in which
there are a large number of rows for each object that is frequently updated. In business IT, we
look for ways to keep these numbers down, but it's not our purpose to describe those heuristics
here.

But one could ask why the assumption must be that a match of unreliable business keys is a
no-match. Perhaps application-specific circumstances mean that the odds are nine out of ten that
such matching keys do in fact represent the same object. In that case, why not assume that they
do?

The reason, as all reasons are in business IT, is a negative impact on the bottom line. In many
cases, the cost of creating one mistaken homonym is higher than the cost of creating nine, or even
a hundred, synonyms. In this case, "homonym" refers to one row representing multiple objects,
and "synonym" refers to multiple rows representing the same object.

In these terms, the question, "Why assume a no-match?" can be rephrased as, "Why assume that
homonyms are more costly than synonyms?" And the answer is that in general, they are. Here's
why.

In the case of synonyms, when we discover one, we have to do what is called a "file merge." If two
Client table rows are discovered to represent the same client, for example, we must replace them
with a single row. As for source transactions, we must change the match logic to point pairs of
transactions hitherto directed to different targets, to the one new target which replaced them. As
for RI dependencies, we must take all the children RI-dependent on either of the two original rows,
change their foreign keys to point back to the one new row which replaced them. But notice that
all these changes can be done with code. There is no need for human judgment in the revised
match logic, or in the new RI dependencies.

But such is not the case when we are dealing with homonyms. When we discover a homonym, we
must do what is called a "file split." Say we split one client table row into two rows. The next time
update transactions appear that had both been applied to the original row, how do we determine
which of the two new clients to direct each one to? By the very fact that the homonym existed to
begin with, we know that source systems have thus far failed to make the distinction. So the match
logic must be updated to make a discrimination it previously did not, or could not, make. As for RI
dependencies, with a single parent client row replaced by two (or more) client rows, how are we to
know which new parent to redirect each RI child to?

Occasionally, the means to make these new discriminations will exist in the data, and so the new
logic can be fully automated. But far more frequently, those means do not exist in the data. In that
case, file splits can only be done if we can devote human resources to the one-time task of fixing
foreign keys, and to the recurring task of augmenting match logic to make discriminations that
cannot be made on the basis of the data alone.

In short, doing less discriminatory work is easy; doing more is hard. That is why homonyms cost
more than synonyms. And that, finally, is why in the absence of cost/benefit data to the contrary,

4 of 5 10/7/2008 1:17 PM
Time and Time Again: Managing Time in Relational Databases, Part 25 - ... http://www.dmreview.com/dmdirect/2008_70/10001237-1.html?type=pri...

we should treat unreliable source system business keys by permitting duplicate rows into the
target tables.

Sometimes, of course, a cost/benefit analysis in a specific situation will indicate that it is better to
create the occasional homonym than it is to create a flurry of synonyms. But the default
assumption always has to be that homonyms are to be avoided, and synonyms reluctantly
tolerated.

We have spent this much time on the business key match problem because it is so important. In
the past, the problem was often ignored, and each doubtful situation treated as an insert, as a
new object situation. The reason was always the difficulty in producing reliable matches in the
presence of match criteria data of poor quality, data which often came from outside our enterprise
and so was beyond our ability to clean up, or in the absence of common match criteria across
different source systems.

But throwing up our hands and populating source tables with an unknown and possible significant
number of duplicate rows, rows which represent the same object, is increasingly unacceptable.
Increasingly, businesses believe that the high cost of solving the problem, or at least reducing its
impact, is less than the cost of living with it. We have several times used the example of merging
customer files in the presence of unreliable source system keys. We chose this example because
the high cost of living with un-merged customer data is most apparent in customer-facing systems.

We will have more to say about a general approach to keeping the cost of duplicate-containing
tables to a minimum in a later set of articles. We think there is something of value to describe that
will apply to nearly all situations in which unreliable business keys must be matched. While this is
not a problem exclusive to versioned tables, neither is it a problem exclusive from them.

In the meantime, we will turn out attention back to inserts and upserts to temporal tables in our
next article.

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

5 of 5 10/7/2008 1:17 PM
Time and Time Again - Managing Time in Relational Databases, Part 26: ... http://www.dmreview.com/dmdirect/2008_72/10001287-1.html?type=pri...

Time and Time Again - Managing Time in


Relational Databases, Part 26: Points in Time
and Periods of Time
Tom Johnston, Randall Weis
DM Direct, May 16, 2008

Note: A glossary of technical terms used in these articles can be found on the Web sites of the
authors. The URLs are MindfulData.com and InbaseInc.com.

This glossary is being constructed as a controlled vocabulary. By that we mean that any expression
involving temporal concepts which is used in the definition of a glossary entry must itself be a
glossary entry.

Join the Largest Community of MDM Experts in New York City,


October 19-21, 2008
Restructured to better suit your business needs, MDM Summit Fall 2008
offers value of experience and thought leadership. Speakers will discuss
hierarchy management, identity resolution and more. Pre-register by
August 8 for bonus savings!

One reason for being this careful with definitions is that this kind of rigor must be applied to
definitions before they can be "ontologized," i.e., expressed in a formal notation like first-order
logic. That, in turn, is what is needed to make definitions available for manipulation by software-
realized inferencing engines.

Another reason for being this careful with definitions is that this is the kind of rigor that must be
applied to any set of technical definitions before we can claim not only that we say what we mean,
but also that we know what we mean when we say it.

As this series continues, context becomes increasingly important so that the thread of the
discussion is not lost. Figure 1 reviews what we've done so far.

Figure 1: Chart of Installments to Date in the Series (See PDF link at the end of the article.)

Last time, we began looking at original and temporal insert transactions but immediately digressed
into a discussion of the problems that can be encountered when matching transactions to target
tables. We planned to discuss both temporal entity integrity and temporal referential integrity as
they apply to insert transactions in this installment. But we discovered that those discussions
require another digression, this time into how we use pairs of dates to represent periods of time.

We also discovered that we have vacillated in this series between two methods of using pairs of
dates to represent periods of time, methods which are called the "closed-closed" and
"closed-open" approaches. Two other methods are possible: "open-closed" and "open-open." But
we won't enter into an extensive discussion of all four methods, because that has already been
covered by both Snodgrass and by Darwen and Lorentzos.1,2

1 of 5 10/7/2008 9:25 AM
Time and Time Again - Managing Time in Relational Databases, Part 26: ... http://www.dmreview.com/dmdirect/2008_72/10001287-1.html?type=pri...

There is both a practical and a theoretical side to the question of how to represent time periods
with dates. On the theoretical side, dates (or datetimes, or timestamps) are discrete points along a
continuous timeline. And as Zeno first discovered, and Leibniz and Newton later formalized, the
relationship between the continuous and the discrete is problematic, equally so for time as for
space.

Points in Time and Periods of Time: Practical Issues

Consider the following two versions, V1 and V2, of the same object. Let us suppose that there is
no gap in time between them, i.e., no gap along the "effectivity timeline." How are we to represent
this? Two of the possibilities, the two specific ones we have vacillated between, are shown in
Figures 2 and 3.

We also assume, in these examples, that the clock tick granularity is one day.

In both cases, the first effective time period starts on 2/19/06 and ends on 5/22/07, and the second
starts on 5/23/07 and ends on 10/14/09. So it might seem that the closed-closed representation is
the correct one, and that the closed-open representation is simply wrong. But that is not the case.

We could try to illustrate the wrongness of the closed-open representation by querying for the time
period that contains the date 5/23/07. On the closed-open representation, can we tell which
version's time period is the desired one? Yes, we can. We just need the following WHERE clause
in our SQL query:

WHERE EFFECTIVE-BEGIN-DATE LESS-THAN OR EQUAL TO "5/23/07"


AND "5/23/07" LESS THAN EFFECTIVE-END-DATE

With this clause, the query will correctly pick out V2.

So why might we have thought that the closed-open representation is wrong? Probably because
we had the mental image of our desired date being between the begin and end dates of the
versions. But "between" as we ordinarily understand it is not "BETWEEN" as SQL understands it.

We would not have gotten confused, of course, if we had used the closed-closed approach. In
that case, we could have written:

2 of 5 10/7/2008 9:25 AM
Time and Time Again - Managing Time in Relational Databases, Part 26: ... http://www.dmreview.com/dmdirect/2008_72/10001287-1.html?type=pri...

WHERE "5/23/07" BETWEEN EFFECTIVE-BEGIN-DATE AND EFFECTIVE-END-DATE

But because both clauses return the correct result, provided each is used with its corresponding
method of representing periods of time, both methods are equally correct.

So in our own vacillation between these two methods, it was this consideration which led us to
(sometimes) describe the closed-closed method as the one we preferred.

What, then, is the advantage of using the closed-open method? Well, look again at Figures 2 and
3. In both cases, V1 and V2 are contiguous. We know this because we have set up the example
on the assumption that there is no gap in time between them. With the closed-open
representation, however, we can instantly see that there is no such gap. But with the closed-
closed representation, we would also need the information that the clock tick granularity being
used is a granularity of one day.

The real problem with the closed-closed representation, however, is not what is or is not
immediately clear to us human beings. It is the relative complexity of the code which will often be
called upon to determine, of two consecutive versions, whether or not they are contiguous, i.e.,
whether or not there is a clock tick between them. With a closed-closed representation, that code
will also have to know what the clock tick granularity is.

So in our own vacillation between these two methods, it was this consideration which led us to
(sometimes) describe the closed-open method as the one we preferred.

Once the SQL standards groups can agree on temporal extensions to the standard, part of that
agreement will certainly be a way of representing time periods directly, without relying on the
confusing circumlocutions of various ways to use pairs of dates to represent time periods. But our
concern, in these articles, is with today's SQL, and so we must choose a date-pair method of
representation. Therefore, from this point forward, unless we discover other considerations which
would dramatically tip the scale the other way, we will use the closed-open representation of time
periods.

Points in Time and Periods of Time: Theoretical Issues

During our "closed-closed" phase, some time ago, one of our readers wrote us to say that closed-
closed was simply wrong for what we will call "theoretical" reasons. He wrote:

A time period most definitely does not end one clock tick prior to the end date. It ends
at precisely the time specified as the end date….. If the period ends at 12:10:15 then
when the clock strikes 12:10:15 the period is over THEN and the next period begins
precisely THEN. At that instant…

If you don’t do it that way then you are saying that date/time fields that are used for the
beginning times are interpreted as meaning the beginning of the stated time period
and date/time fields used for ending times are interpreted a DIFFERENT way, namely
as indicating the end of the stated time period. One can’t do things that way, i.e.,
changing the definition of a data type based on the context.

This reader also has an argument that is practical in nature, namely that with the closed-open
method, date arithmetic will always involve a "minus 1" clause that would not be required on the
closed-closed approach. But as we have seen, there are practical concerns no matter how we

3 of 5 10/7/2008 9:25 AM
Time and Time Again - Managing Time in Relational Databases, Part 26: ... http://www.dmreview.com/dmdirect/2008_72/10001287-1.html?type=pri...

choose to use date-pairs to represent time periods, and these concerns are therefore not
conclusive.

Our first response to our reader's theoretical argument, briefly, is this. First, the closed-open
approach does not "chang(e) the definition of a data type." If we changed the definition of a data
type depending on which columns we applied it to, then the DBMS or programming language that
accessed those columns would not be able to handle them correctly. So we are not changing the
definition of the data type. The issue, instead, is what we take each date to mean. And as long as
we write our SQL carefully (see the WHERE clauses, above), we will get the correct results no
matter which date-pair interpretation we choose.

Our second response is that decades of computer science research have shown that we can and
should represent the flow of time as a succession of atomic clock ticks, ticks which may be of any
actual length of time, such as a day, a nanosecond, etc. But no one ever has, and no one ever
will, invent a clock so precise that there is no time at all between its ticks. Only if we had such a
clock could we use clock ticks to represent time periods in a manner that conforms to the intuitions
of this reader, namely that "If the period ends at 12:10:15 then when the clock strikes 12:10:15 the
period is over THEN and the next period begins precisely THEN. At that instant."

This interpretation creates an ambiguity in the use of date-pairs to delimit time periods that can
never be resolved. For on this interpretation, two adjacent versions overlap at exactly that tick of
the clock. On that tick of the clock, this interpretation requires us to say that both V1 and V2 are in
effect.

Without going into any detail about the theoretical underpinnings of these arguments, we can
make the following points.

First, at the Planck scale, both space and time are, as far as we can measure them, discrete, with
a unit of Planck time (5.3906 x 10 to the minus 44 seconds) being the time it takes for a photon of
light to travel the Planck distance (1.6160 x 10 to the minus 35 meters). But this is too theoretically
remote to be relevant to a discussion of time as far as IT is concerned. However, it does cast
considerable doubt, obviously, on the hope that we will ever have a clock whose ticks do not take
up a finite period of time.

Second, with quantum physics considerations aside, both space and time are continuous, but
neither points (in space) nor instants (points in time) are. The ensuing conceptual difficulties were
first noted by Zeno, which he described in his paradox of the hare and the tortoise.

The solution to the correct representation of something continuous by means of something


discrete was ultimately solved, for space at least, by Leibniz and Newton. And their concept of a
limit applies equally well to time. But this is still too remote for its relevance to managing time in
relational databases to be apparent. However, for a good basic discussion of these issues, we
recommend Sowa 2000.3

Finally, though, the concepts necessary to correctly manage the relationships among time periods
represented as pairs of dates were originally formulated by James F. Allen in the mid-1980s. This
reference, and an excellent exposition of this material, is contained in Snodgrass 2000 and in
Darwen and Lorentzos 2002. Both contain references to the earliest of three articles in which Allen
explained his concepts.

With our digressions completed, we will proceed to a discussion of original and temporal inserts in

4 of 5 10/7/2008 9:25 AM
Time and Time Again - Managing Time in Relational Databases, Part 26: ... http://www.dmreview.com/dmdirect/2008_72/10001287-1.html?type=pri...

our next article, and will explain how the temporal correlates of entity integrity and referential
integrity apply to them.

References:

1. C. J. Date, Hugh Darwen, Nikos Lorentzos. Temporal Data and the Relational Model.
Morgan-Kaufmann, 2002, 90-94.
2. Snodgrass, R. T. Developing Time-Oriented Database Applications in SQL. Morgan-
Kaufmann, 2000.
3. Sowa, John F. Knowledge Representation. (Brooks-Cole, Pacific Grove CA, 2000, 103-124.

For more information on related topics, visit the following channels:

Databases

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.
Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.

5 of 5 10/7/2008 9:25 AM