Four Approaches to Data Warehousing

The requirements of a Data Warehouse solution continually change during its life time. For some organizations you’ll typically expect more changes than others. Think of maturity, history of organizational changes: mergers, acquisitions etc.

You should adopt a solution type that will yield the best value taking these things into account.

Data Warehouse Solutions

When confronted with a new data warehouse solution that I am going to design or redesign I will typically choose between one of the following four approaches:

  1. One-Shot Data Warehouse
  2. Keeping History in Dimensions
  3. Historical / Persistent Staging Area
  4. Data Vault

In this blog post I’ll briefly dive into these approaches and discuss some of the implementation issues.

The choice for a specific approach depends on a number of circumstances, notably:

  • The expected total size of the solution
  • The number of source systems and how the information in them overlap
  • The average expected life-time of source systems: source system volatility
  • Current user requirements
  • Expected number of changes in user requirements: requirements volatility
1. One-Shot Data Warehouse (small- and mid-sized solution)
gun

Current computing possibilities will typically enable the complete build (from scratch) of a data warehouse solution for small- and mid-sized solutions within the typically nightly maintenance window. This can be a very efficient solution with a high return on investment. Some people call it the destroy and rebuild approach since you are removing all previous data from the data warehouse before rebuilding it.

An obvious disadvantage of this approach is that there is no track of changes in the source system: If a customer moves from London to Paris all the historical sales will be reported as Paris sales. Another disadvantage is the sheer amount of data that is transferred every night. That can accumulate to an amount that can not been loaded in the nightly maintenance window.

2. Keeping History in Dimensions

The problem of keeping track of history has been a major issue in data warehousing. In the theories  by Ralph Kimball Slowly Changing Dimensions play an import role. In his books he mentioned some patterns to handle history in dimensions.

Type 1: Change history

For some attributes the business isn’t interested in keeping historical information. This of course is most appropriate when correcting previous errors. Or in other situations where there is no business value in keeping track of historical information. For example when a person’s name changes due to marriage. The rows that were imported earlier in the dimension table are corrected by an UPDATE statement.

Type 2: Keep history

In this pattern history is preserved because a new record is entered for every change in a dimension attribute. The old row is marked as inactive and an end date is added to the row. Any new fact tables records that are inserted after this dimension row is changed will link to the newly added row.

Other types

Type 1 and Type 2 are the most commonly used. However there are some alternatives. They are described in this Wikipedia article.

Other advantages

Besides the tracking of history another major advantage is that you’ll only need changed and new rows (the delta rows) from your source system. And thus diminishing the time you need to load the data in your data warehouse. However getting only these delta rows can be challenging.

3. Historical / Persistent Staging Area

In this approach you’ll typically get a copy of all relevant source tables and add temporal information in a ‘StartDate’ and a ‘EndDate’ column. Again you’ll only process the delta rows: loading new and changed rows. And whenever a row has changed you’ll end date the old row.

Based upon this historical staging area you can adopt method 1 or 2 to load the data in your dimensional model.

What are the reasons for having a persistent staging area?

In an earlier blog post I gave this answer:

Well …first of all it could be a demand from auditors or data governance initiatives. Possible driven by external regulations.(Sarbanes-Oxley, Basel I, Basel II, HIPAA, ..)

A second reason which can be tied to the first has to do with keeping control of / reporting on data quality issues in the source systems and thus:

  • Identify possible improvements in processes. (e.g. same data entered twice)
  • Increase / decrease confidence in the data, information and decisions

The third reason has to do with agility. The ability to respond to changes rapidly. Small changes like changing a dimension attribute from type 1 to 2 or adding additional attributes to a dimension. But also large changes like big organizational changes, mergers, new ERP implementations. By having a persistent staging area  it’ll be far easier to respond to these changes and rebuild the data marts (with history). Resulting in quicker, better reporting after such a change.

4. Data Vault

The Data Vault is a special implementation of a persistent staging area. Source tables are split based on column aspects separating the structural items (business keys and the associations between the business keys) from the descriptive attributes.

And thus the same advantages apply to the Data Vault as well. The power of the Data Vault comes from the integration on business keys. This approach is therefore especially suitable if you have source systems with overlapping information and/or if source systems change frequently over time.

This wiki article by Ronald Kunenberg gives a nice overview of the Data Vault Methodology with a lot of additional references.

Conclusion

In this article I discussed some aspects that you should consider when you choose an approach for (re)designing your Data Warehouse. This table summarizes these aspects:

Aspect /  Approach:1234
Simple, fast solutionx   
Only process delta rows xxx
Keep track of historical information xxx
Solve auditing / data governance issues  xx
Control over data quality in source systems  xx
Easily adapt to changing user requirements  xx
Source systems with overlapping information   x
Frequently changing source systems   x

Dimensional modeling and Data Vault – a happy marriage?

A lot of emphasis on the Internet lately has been placed on the difference between both methods. This blogpost from Dan Linstedt and this discussion on the Kimball forums are examples of this strange focus.

A focus I don’t totally understand since both techniques serve different purposes and can happily live together in most data warehouse solutions:

  • Data Vault is all about the back-room, efficiently collecting, integrating and preserving data from the source systems.
  • Dimensional modeling is all about the front-room, publishing the organization’s data assets to effectively support decision making.

From a Dimensional modeling (Kimball) perspective Data Vault is an implementation of a persistent staging area. Problems that some people have with this view/name seem to boil down to the definition of what a data warehouse is. I personally don’t have a problem with this name nor do I consider it to be less important. And yes it will qualify as a data warehouse.

Furthermore Dimensional modeling is generally acknowledged to be the best way to publish data to the business users and has a wide-spread adoption. A focus for further acceptance of Data Vault could be that it is the best implementation of a persistent staging area.

So let’s try to make this case … but first:

What are the reasons for having a persistent staging area?

Well …first of all it could be a demand from auditors or data governance initiatives. Possible driven by external regulations.(Sarbanes-OxleyBasel IBasel IIHIPAA, ..)

A second reason which can be tied to the first has to do with keeping control of / reporting on data quality issues in the source systems and thus:

  • identify possible improvements in processes. (e.g. same data entered twice)
  • increase / decrease confidence in the data, information and decisions

The third reason has to do with agility. The ability to respond to changes rapidly. Small changes like changing a dimension attribute from type 1 to 2 or adding additional attributes to a dimension. But also large changes like big organizational changes, mergers, new ERP implementations. By having a persistent staging area (preferably integrated on business keys) it’ll be far easier to respond to these changes and rebuild the data marts (with history). Resulting in quicker, better reporting after such a change.

Of course you need to balance these advantages against the extra costs. Before we take a look at these extra costs and other raised disadvantages let’s examine some …

Similarities between Data Vault and Dimensional modeling

Let’s first focus on the methodology of both approaches. They both advocate a bottom-up iterative implementation: Identify scope, select business area, build the solution and .. focus on a new iteration.

Both use a limited number of strict table types with a well-defined function: Dimension- and fact tables in Dimensional modeling and hub-, link- and satellite tables in Data Vault modeling. Where a hub tables and its satellites typically roll-up into a dimension and a link tables and its satellites typically roll-up into a fact table.

Another “fit” is the use of a volatile staging area as a synchronization- and recovery point.

Focus on the extra costs

Some would argue that since you are building two data warehouses you’ll end up with twice the costs. That’s seems far fetched … remember both use a staging area and the existence of the Data Vault area will make it easier to construct the Dimensional model on top of that. Furthermore Data Vault modeling lends itself (as does Dimensional Modeling to some extent) to automation, which greatly reduces costs of building and maintaining the data warehouse solution.

Others complain about the extra storage that is needed to support this solution. Why store an extra copy of the source? Well I hope the described reasons for having a persistent staging area answer these questions. However you should of course take the extra storage costs into account when evaluating your choices. But please look at opportunity costs: Do you need an extra DBA or an extra SAN to support the extra layer?

A third complaint has to do with query performance. It is suggested that the extra joins introduced with Data Vault modeling will impact query performance. This of course depends on size, hardware, database and indexing strategy. But even if there is a significant impact in most cases it won’t have a major effect. After all we are only using these queries to populate the Dimensional model and this will only apply to new and changed rows.

Conclusion

Consider the combination of both methods when you are (re)building your data warehouse. In this blogpost I have given you some handles to evaluate this option. Do you need to keep control on data quality issues or do you want the agility to respond to changes quicker? Can you quantify these advantages? Compare them to the extra costs and make a better informed decision.

And of course you can cont(r)act me if you want any help in this evaluation.