There are many reasons for me to blog. One of the less vain is to recover and reuse code I wrote earlier. In my current project we have a need to test, test and retest the initial load of the data warehouse.
For this purpose I wrote this T-SQL script that:
Drops all foreign key constraints;
Truncates all the tables;
Recreates all foreign key constraints. I dropped earlier.
This is the second post of a diptych on the magical windowing functions in data warehouse scenarios. With these functions you can greatly simplify the TSQL you write. Many complex queries with CTE’s, temp tables and sub queries can be rewritten to simpler, better maintainable and better performing queries. In this post I’ll dive into some possibilities for reporting.
On the Internet you’ll find a lot of examples on using the running total technique to calculate year to date values. In this example I need the monthly sales and the YTD sales for every store.
SELECT CalendarMonth , StoreName , PeriodSales , SalesYTD = SUM(PeriodSales) OVER (PARTITION BY StoreName, CalendarYear ORDER BY CalendarMonth) FROM ( SELECT CalendarYear , CalendarMonth , StoreName , PeriodSales = SUM(sal.SalesAmount) FROM FactSales sal JOIN DimDate dat ON sal.DateKey = dat.Datekey JOIN DimStore sto ON sal.StoreKey = sto.StoreKey GROUP BY CalendarYear, CalendarMonth, StoreName ) SalesByMonth ORDER BY StoreName, CalendarMonth
The sub query “SalesByMonth” aggregates the sales amount for every store per month. The windowing function SUM() OVER() calculates the YTD sales. Which will result in the required dataset:
The SUM(SUM()) OVER() Construction
Since you can use Windowing Functions over an aggregated we don’t need the sub query and we can simplify this query to:
SELECT CalendarMonth , StoreName , PeriodSales = SUM(SalesAmount) , SalesYTD = SUM(SUM(SalesAmount)) OVER (PARTITION BY StoreName, CalendarYear ORDER BY CalendarMonth) FROM FactSales sal JOIN DimDate dat ON sal.DateKey = dat.Datekey JOIN DimStore sto ON sal.StoreKey = sto.StoreKey GROUP BY CalendarYear, CalendarMonth, StoreName ORDER BY StoreName, CalendarMonth
The second SUM in: “SUM(SUM()) OVER() GROUP BY “ is used in conjunction with the GROUP BY clause to calculate the monthly sales first. The first SUM in: “SUM(SUM()) OVER() GROUP BY “ is then used in conjunction with the OVER clause to calculate the YTD sales.
Comparing to previous year
Adding the figures of the previous year as a comparison is a common reporting requirement. You can easily realize this by using the LAG function returning the results 12 months back in time. Building upon our earlier query:
SELECT * , PeriodSalesPrevYear = LAG(PeriodSales,12,0) OVER (PARTITION BY StoreName ORDER BY CalendarMonth) , YTDSalesPrevYear = LAG(SalesYTD,12,0) OVER (PARTITION BY StoreName ORDER BY CalendarMonth) FROM ( SELECT CalendarMonth , StoreName , PeriodSales = SUM(SalesAmount) , SalesYTD = SUM(SUM(SalesAmount)) OVER (PARTITION BY StoreName, CalendarYear ORDER BY CalendarMonth) FROM FactSales sal JOIN DimDate dat ON sal.DateKey = dat.Datekey JOIN DimStore sto ON sal.StoreKey = sto.StoreKey GROUP BY CalendarYear, CalendarMonth, StoreName ) Base ORDER BY StoreName, CalendarMonth
Which results into:
How Do We Do Compared to the other stores?
In this example I use the RANK() Function to determine the store’s rank in the total monthly sales and the store’s sales as a percentage of the total monthly sales:
SELECT CalendarMonth , StoreName , PeriodSales = SUM(SalesAmount) , StoreRank = RANK() OVER (PARTITION BY CalendarMonth ORDER BY SUM(SalesAmount) DESC) , StoreShare = 100*SUM(SalesAmount)/ SUM(SUM(SalesAmount)) OVER (PARTITION BY CalendarMonth) FROM FactSales sal JOIN DimDate dat ON sal.DateKey = dat.Datekey JOIN DimStore sto ON sal.StoreKey = sto.StoreKey GROUP BY CalendarMonth, StoreName
Compare to (Average of) Previous Periods
In a recent client engagement a report which was used to audit the monthly invoice process gave a lot of troubles. The SQL query behind it was very difficult to comprehend and consisted of several sub queries. By using windowing functions our team was able to greatly simplify the query. The requirement can be restated/simplified to our example as: Give us the current month sales, the previous 3 and the average of those previous 3. This is the resulting query:
SELECT CalendarMonth , StoreName , PeriodSales = SUM(SalesAmount) , SalesPrevPeriod1 = LAG(SUM(SalesAmount),1,0) OVER (PARTITION BY StoreName ORDER BY CalendarMonth) , SalesPrevPeriod2 = LAG(SUM(SalesAmount),2,0) OVER (PARTITION BY StoreName ORDER BY CalendarMonth) , SalesPrevPeriod3 = LAG(SUM(SalesAmount),3,0) OVER (PARTITION BY StoreName ORDER BY CalendarMonth) , AveragePrevPeriods = AVG(SUM(SalesAmount)) OVER (PARTITION BY StoreName ORDER BY CalendarMonth ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING) FROM FactSales sal JOIN DimDate dat ON sal.DateKey = dat.Datekey JOIN DimStore sto ON sal.StoreKey = sto.StoreKey GROUP BY CalendarYear, CalendarMonth, StoreName ORDER BY StoreName, CalendarMonth
I especially like the way you can use the window frame clause to limit the average to 3 periods: ROWS BETWEEN 3 PRECEDING AND 1 PRECEDING
Windowing functions, introduced in SQL Server 2005 and greatly enhanced in SQL Server 2012, have something magical: Within the context of one row in the result set you have access to the contents of the other rows of the result set.
With these functions you can greatly simplify the TSQL you write. Many complex queries with CTE’s, temp tables and sub queries can be rewritten to more simple, better maintainable and better performing queries. In this post I’ll dive into some possibilities for dimensions.
In SQL 2008 and earlier
In SQL 2008 and earlier I will typically build a dimension table based upon the Type 2 Slowly Changing Dimensions system. And then use a view with a self join to present the user with the historical and/or current attribute values. I will use this small example of a customer dimension:
When Marco moved to Paris on July 20th 1988 and married Jose for both of them a new row was added with the new attributes and the EndDate of the old row was changed to the date of the change. This is how attributes changes are handled for Type 2 Slowly Changing Dimensions.
To consume the dimension information I will typically use a view in the model schema as a source for Analysis Services / PowerPivot / Report:
CREATE VIEW model.Customer AS SELECT his.Id , his.Number , his.Name , his.City , CurrentCity = cur.City , his.MaritalStatus , CurrentMaritalStatus = cur.MaritalStatus FROM dim.Customer his JOIN dim.Customer cur ON his.Number = cur.Number WHERE cur.EndDate =’9999-12-31′
Which will result into:
In SQL 2012 Using the LAST_VALUE Function
If your not working for the Oracle at Delphi the last value will typically be the current value of an attribute. So in 2012 this view can be replaced with:
CREATE VIEW model.Customer AS SELECT Id , Number , Name , City , CurrentCity = LAST_VALUE(City) OVER(PARTITION BY Number ORDER BY StartDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) , MaritalStatus , CurrentMaritalStatus = LAST_VALUE(MaritalStatus) OVER(PARTITION BY Number ORDER BY StartDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM dim.Customer
Although the LAST_VALUE function seems a bit awkward to write due to the long window frame clause: “ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING” it has some nice advantages:
No need to use a self join, which will enhance performance.
No need to use the EndDate column.
Much easier to maintain. Because the purpose of the LAST_VALUE function will be more obvious for your successors.
SCD Type 0 with the FIRST_VALUE Function
Occasionally you may stumble upon a request/requirement to show the original value of an attribute. (e.g. the sales that landed the customer). In that case you can simply add a column using the FIRST_VALUE function:
FirstCity = FIRST_VALUE(City) OVER(PARTITION BY Number ORDER BY StartDate)
Mapping Queries Using Windowing Functions
When you load fact tables you will want to lookup the surrogate keys of the dimensions. In the most simple variant you would use (in SQL 2008)
SELECT Number, Id FROM dim.Customer WHERE EndDate =’9999-12-31′
In SQL 2012, assuming you will not store the EndDate in your ETL process, you can use:
SELECT Number, Id FROM (SELECT Number, Id, RowNumber = ROW_NUMBER() OVER(PARTITION BY Number ORDER BY StartDate DESC) FROM dim.Customer) Sub WHERE RowNumber=1
Unfortunately you will have to use the sub query construct here because it’s not yet possible to use Windowing Functions in the WHERE clause.
But why Would you not add an EndDate in the ETL Process?
If you don’t end date rows the ETL process gets much easier, faster and less error-prone: You don’t have the distinguish between new and changed rows you’ll just add both in the same way to the dimension table. And you don’t have to identify and update the ‘old’ rows.
And of course if you really need the EndDate you can just get if with the new LEAD function:
EndDate = LEAD(StartDate, 1, ‘9999-12-31’) OVER(PARTITION BY Number ORDER BY StartDate)
Type 2 Loading is a pattern that is often used to incrementally load dimensions in a data warehouse. But you can also use it for loading satellites in a data vault or loading tables in a historical/persistent staging area. It consist of three parts:
Identify new rows and add them to the target database.
Identify changed rows and add them to the target database.
End date the old rows in the target database that have been replaced with those of step 2.
MERGE
There are several ways to implement a Type 2 loading pattern. In this blog post I’ll describe using the TSQL statement: MERGE. I have come to like it and compared to other methods it seems to be a good performing option for handling slowly changing dimension type solutions. (See this study from Microsoft’s Matt Mason)
In his example for processing type 2 dimension he uses a sample that is similar to the one below:
INSERT INTO dimCustomer (AuditID,StartDate,EndDate, CustomerID, CustomerName, Adress..) SELECT @AuditID, GETDATE(),NULL,CustomerID, CustomerName, Adress,.. FROM
( MERGE DWH.dbo.dimCustomer tgt USING STG.dbo.Customer src ON src.CustomerID = tgt.CustomerID
WHEN NOT MATCHED THEN INSERT VALUES (@AuditID, GETDATE(), NULL, CustomerID, CustomerName, Adress ..)
WHEN MATCHED AND tgt.__EndDate IS NULL AND (tgt.CustomerName <> src.CustomerName AND (tgt.Adress <> src.Adress) .. THEN UPDATE SET tgt.EndDate = GETDATE()
OUTPUT $Action Action_Out, src.CustomerID, CustomerName, Adress .. ) AS MERGE_OUT WHERE MERGE_OUT.Action_Out = ‘UPDATE’;
The outcome of the MERGE statement is used as source for the INSERT INTO statement at the top. (Afore mentioned step 2: Identify changed rows and add them to the target database) In the Merge statement there are two actions: 1. Insert any new rows 3. End date the rows in the source that have changed.
NULL <> NULL
A problem with this example is the implicit assumption that there are no NULL values stored in the columns. Because otherwise the highlighted expressions with nulls would yield to true (NULL<>NULL) and rows would unnecessary be added to the model.
There are however two SQL expressions that consider NULL Values as equal: INTERSECT and EXCEPT. So we can easily circumvent this problem by changing the highlighted rows into: