Data Cleansing with SSIS

Data cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table or database. For an example correcting misspelled records from a given record set.

Below diagram shows when data cleansing is required.

To avoid those situations and to have consistent, relevant and accurate data, data cleansing is required.

Stages of the data cleansing process is as shown below.

As shown in the above figure, before data cleansing, Data Quality Testing need to be done.
After that data cleansing is done by parsing, data transformation, duplicate elimination and by doing some statistical analysis.
The final output of the data cleansing process will be accurate, consistent and relevant data.

SQL Server Integration Services(SSIS) is providing the facility to implement data cleansing processes.
There are some components which can be used to perform data cleansing operations provided by SSIS.
They are
  • Lookup
  • Fuzzy Lookup
  • Fuzzy Grouping

Below diagram shows how we can use those components to achieve data cleansing for a given record set.

In that example a simple scenario is considered. For a given sales record set, it checks whether the customer exists. If not it enters the customer record by considering it as a new customer.
There may be situations, that the customer name is misspelled. SSIS Fuzzy Lookup component can be used identify that kind of situations.
Also when new customer records are inserted, duplicate records may exists. To prevent inserting duplicate records, Fuzzy Grouping component can be used.

I'll discuss how Fuzzy Lookup and Fuzzy Grouping works in another post...

No comments:

Post a Comment

How the transformed data is written to an output file with column headers in U-SQL...

While working with U-SQL language, I noticed that there are few ways of writing data to an output file. Let's assume the SalesDetails....