Type 2 Loading is a pattern that is often used to incrementally load dimensions in a data warehouse. But you can also use it for loading satellites in a data vault or loading tables in a historical/persistent staging area. It consist of three parts:
- Identify new rows and add them to the target database.
- Identify changed rows and add them to the target database.
- End date the old rows in the target database that have been replaced with those of step 2.
There are several ways to implement a Type 2 loading pattern. In this blog post I’ll describe using the TSQL statement: MERGE. I have come to like it and compared to other methods it seems to be a good performing option for handling slowly changing dimension type solutions. (See this study from Microsoft’s Matt Mason)
Warren Thornthwaite from the Kimball Group did a good explanation on using the SQL MERGE Statement for Slowly Changing Dimension in Design Tip #107 Using the SQL MERGE Statement for Slowly Changing Dimension Processing.
In his example for processing type 2 dimension he uses a sample that is similar to the one below:
INSERT INTO dimCustomer (AuditID,StartDate,EndDate, CustomerID, CustomerName, Adress..)
SELECT @AuditID, GETDATE(),NULL,CustomerID, CustomerName, Adress,..
( MERGE DWH.dbo.dimCustomer tgt
USING STG.dbo.Customer src
ON src.CustomerID = tgt.CustomerID
WHEN NOT MATCHED THEN
INSERT VALUES (@AuditID, GETDATE(), NULL, CustomerID, CustomerName, Adress ..)
AND tgt.__EndDate IS NULL
AND (tgt.CustomerName <> src.CustomerName
AND (tgt.Adress <> src.Adress)
UPDATE SET tgt.EndDate = GETDATE()
OUTPUT $Action Action_Out, src.CustomerID, CustomerName, Adress ..
) AS MERGE_OUT
WHERE MERGE_OUT.Action_Out = ‘UPDATE’;
The outcome of the MERGE statement is used as source for the INSERT INTO statement at the top.
(Afore mentioned step 2: Identify changed rows and add them to the target database)
In the Merge statement there are two actions:
1. Insert any new rows
3. End date the rows in the source that have changed.
NULL <> NULL
A problem with this example is the implicit assumption that there are no NULL values stored in the columns. Because otherwise the highlighted expressions with nulls would yield to true (NULL<>NULL) and rows would unnecessary be added to the model.
There are however two SQL expressions that consider NULL Values as equal: INTERSECT and EXCEPT. So we can easily circumvent this problem by changing the highlighted rows into:
AND EXISTS (SELECT tgt.CustomerName, tgt.Adress …
EXCEPT SELECT src.CustomerName, src.Adress …)