Surrogate keys are widely used and accepted design standard in data warehouses. It is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. It join between the fact and dimension tables and is necessary to handle changes in dimension table attributes.
It is UNIQUE since it is sequentially generated integer for each record being inserted in the table.
It is MEANINGLESS since it does not carry any business meaning regarding the record it is attached to in any table.
It is SEQUENTIAL since it is assigned in sequential order as and when new records are created in the table, starting with one and going up to the highest number that is needed.
The
below
diagram
shows
how
the
FACT
table
is
loaded
from
the
source.
The below image shows a typical Star Schema, joining different Dimensions with the Fact using SKs.
Ralph Kimball emphasizes more on the abstraction of NK. As per him, Surrogate Keys should NOT be:
Smart, where you can tell something about the record just by looking at the key. Composed of natural keys glued together. Implemented as multiple parallel joins between the dimension table and the fact table; socalled double or triple barreled joins.
As per Thomas Kejser, a good key is a column that has the following properties:
It forced to be unique It is small It is an integer Once assigned to a row, it never changes Even if deleted, it will never be re-used to refer to a new row It is a single column It is stupid It is not intended as being remembered by users
If the above mentioned features are taken into account, SK would be a great candidate for a Good Key in a DW.
Apart from these, few more reasons for choosing this SK approach are:
If we replace the NK with a single Integer, it should be able to save a substantial amount of storage space. The SKs of different Dimensions would be stored as Foreign Keys (FK) in the Fact tables to maintainReferential Integrity (RI), and here instead of storing of those big or huge NKs, storing of concise SKs would result in less amount of space needed. The UNIQUE indexes built on the SK will take less space than the UNIQUE index built on the NK which may be alphanumeric.
Replacing big, ugly NKs and composite keys with beautiful, tight integer SKs is bound to improve join performance, since joining two Integer columns works faster. So, it provides an extra edge in the ETL performance by fastening data retrieval and lookup.
Advantage of a four-byte integer key is that it can represent more than 2 billion different values, which would be enough for any dimension and SK would not run out of values, not even for the Big or Monster Dimension.
SK is usually independent of the data contained in the record, we cannot understand anything about the data in a record simply by seeing only the SK. Hence it provides Data Abstraction.
So, apart from the abstraction of critical business data involved in the NK, we have the advantage of storage space reduction as well to implement the SK in our DW. It has become a Standard Practice to associate an SK with a table in DW irrespective of being it a Dimension, Fact, Bridge or Aggregate table.
The values of SKs have no relationship with the real world meaning of the data held in a row. Therefore over usage of SKs lead to the problem of disassociation.
The generation and attachment of SK creates unnecessary ETL burden. Sometimes it may be found that the actual piece of code is short and simple, but generating the SK and carrying it forward till the target adds extra overhead on the code.
During the Horizontal Data Integration (DI) where multiple source systems loads data into a single Dimension, we have to maintain a single SK Generating Area to enforce the Uniqueness of SK. This may come as an extra overhead on the ETL.
Even query optimization becomes difficult since SK takes the place of PK, unique index is applied on that column. And any query based on NK leads to Full Table Scan (FTS) as that query cannot take the advantage of unique index on the SK.
Replication of data from one environment to another, i.e. Data Migration, becomes difficult since SKs from different Dimension tables are used as the FKs in the Fact table and SKs are DW specific, any mismatch in the SK for a particular Dimension would result in no data or erroneous data when we join them in a Star Schema.
If duplicate records come from the source, there is a potential risk of duplicates being loaded into the target, since Unique Constraint is defined on the SK and not on the NK.
Crux of the matter is that SK should not be implemented just in the name of standardizing your code. SK is required when we cannot use an NK to uniquely identify a record or when using an SK seems more suitable as the NK is not a good fit for PK.
Generation
Approaches
Using
Informatica
Surrogate Key is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. We discussed aboutSurrogate Key in in detail in our previous article. Here in this article we will concentrate on different approaches to generate Surrogate Key for different type ETL process.
NEXTVAL port from the Sequence Generator can be mapped to the surrogate key in the target table. Below shown is the sequence generator transformation.
Note : Make sure to create a reusable transformation, so that the same transformation can be reused in multiple mappings, which loads the same dimension table.
Schema name (DW) and sequence name (Customer_SK) can be passed in as input value for the transformation and the output can be mapped to the target SK column. Below shown is the SQL transformation image.
Create a database function as below. Here we are creating an Oracle function. CREATE OR REPLACE FUNCTION DW.Customer_SK_Func RETURN NUMBER IS Out_SK NUMBER; BEGIN SELECT DW.Customer_SK.NEXTVAL INTO Out_SK FROM DUAL; RETURN Out_SK; EXCEPTION WHEN OTHERS THEN raise_application_error(-20001,'An '||SQLERRM); END; You can import the database function as a stored procedure transformation as shown in below image. error was encountered '||SQLCODE||' -ERROR-
Now, just before the target instance for Insert flow, we add an Expression transformation. We add an output port there with the following formula. This output port GET_SK can be connected to the target surrogate key column.
Note : Database function can be parametrized and the stored procedure can also be made reusable to make this approach more effective
For a Dynamic LookUP on Target, we have the option of associating any LookUP port with an input port, output port, or Sequence-ID. When we associate a Sequence-ID, the Integration Service
generates a unique Integer value for each inserted rows in the lookup cache., but this is applicable for the ports with Bigint, Integer or Small Integer data type. Since SK is usually of Integer type, we can exploit this advantage.
The Integration Service uses the following process to generate Sequence IDs.
When the Integration Service creates the dynamic lookup cache, it tracks the range of values for each port that has a sequence ID in the dynamic lookup cache.
When the Integration Service inserts a row of data into the cache, it generates a key for a port by incrementing the greatest sequence ID value by one.
When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at one. The Integration Service increments each sequence ID by one until it reaches the smallest existing value minus one. If the Integration Service runs out of unique sequence ID numbers, the session fails.
Above
shown
is
dynamic
lookup
configuration
to
generate
SK
for
CUST_SK.
The Integration Service generates a Sequence-ID for each row it inserts into the cache. For any records which are already present in the Target, it gets the SK value from the Target
Dynamic LookUP cache, based on the Associated Ports matching. So, if we take this port and connect to the target SK field, there will not be any need to generate SK values separately, since the new SK value(for records to be Inserted) or the existing SK value(for records to be Updated) is supplied from the Dynamic LookUP.
The disadvantage of this technique lies in the fact that we dont have any separate SK Generating Area and the source of SK is totally embedded into the code.
CUSTOMER_ID = IN_CUSTOMER_ID
Next in the mapping after the SQ use an Expression transformation. Here actually we will be generating the SKs for the Dimension based on the previous value generated. We will create the following ports in the EXP to compute the SK value.
VAR_INC = VAR_COUNTER
OUT_COUNTER = VAR_COUNTER
When the mapping starts, for the first row we will look up the Dimension table to fetch the maximum available SK in the table. Next we will keep on incrementing the SK value stored in the variable port by 1 for each incoming row. Here the O_COUNTER will give the SKs to be populated in CUSTOMER_KEY.
Suppose, we have a session s_New_Customer, which loads the Customer Dimension table. Before that session in the Workflow, we add a dummy session as s_Dummy.
In s_Dummy, we will have a mapping variable, e.g. $$MAX_CUST_SK which will be set with the value of MAX (SK) in Customer Dimension table.
We will have the CUSTOMER_DIM as our source table and target can be a simple flat file, which will not be used anywhere. We pull this MAX (SK) from the SQ and then in an EXP we assign this value to the mapping variable using the SETVARIABLE function. So, we will have the following ports in the EXP:
This output port will be connected to the flat file port, but the value we assigned to the variable will persist in the repository.
In our second mapping we start generating the SK from the value $$MAX_CUST_SK + 1. But how can we pass the parameter value from one session into the other one?
Here the use of Workflow Variable comes into picture. We define a WF variable as $$MAX_SK and in the Post-session on success variable assignment section of s_Dummy, we assign the value of $$MAX_CUST_SK to $$START_SK. Now the variable $$MAX_SK contains the maximum available SK value from CUSTOMER_DIM table. Next we define another mapping variable in the session s_New_Customer as $$START_VALUE and this is assigned the value of $$MAX_SK in the Presession variable assignment section of s_New_Customer.
$$MAX_SK = $$MAX_CUST_SK
$$START_VALUE = $$MAX_SK
Now in the actual mapping, we add an EXP and the following ports into that to compute the SKs one by one for each records being loaded in the target.
VAR_COUNTER = IIF (ISNULL (VAR_INC), $$START_VALUE + 1, VAR_INC + 1) VAR_INC = VAR_COUNTER OUT_COUNTER = VAR_COUNTER