Wednesday, May 28, 2014

Importance of the Surrogate Key in Data Warehouse Design

In Data Warehousing, Surrogate Key is a unique identification key in a dimension table which is independent from the source.
Normally it is an auto increment integer value.

The purpose to have a surrogate key in a dimension table is to makes the relationship between dimension tables and fact tables independent from the source.
Below example will explain the importance of the surrogate key clearly.

Let's say in a company they maintains the employee code as a fixed digit number and assume the number of digit is 4.
After few years, the company is growing and the number of employees get increased.
Due to that 4 digit employee code will not be sufficient and they decided to makes the number of digits in the employee code to 6.
In such scenario, employee code of the every employee will get changed.

But in the data warehouse, if we use the employee code as the key without using any surrogate key, then the employee is linked with facts using employee code.
Therefore if the employee code get changed, then the employee dimension and all the facts linked with employee dimension needs to be updated.
This is a very costly operation and we need to address such kind of situations while designing the data warehouse.

To avoid that we can use a Surrogate Key. Let's look at how the Surrogate Key avoid that kind of a situation.
Since the Primary Key in the employee dimension is the used surrogate key, then all the linked facts to the employee dimension is linked with that surrogate key.
Due to that even though the employee code gets changed, the only thing you have to do is just update the employee code in the employee dimension. That is it and no need to update any fact table.

Therefore as a best practice, in the dimension tables we use a Surrogate Key as the Primary Key of that table.