A lot of emphasis on the Internet lately has been placed on the difference between both methods. This blogpost from Dan Linstedt and this discussion on the Kimball forums are examples of this strange focus.
A focus I don’t totally understand since both techniques serve different purposes and can happily live together in most data warehouse solutions:
- Data Vault is all about the back-room, efficiently collecting, integrating and preserving data from the source systems.
- Dimensional modeling is all about the front-room, publishing the organization’s data assets to effectively support decision making.
From a Dimensional modeling (Kimball) perspective Data Vault is an implementation of a persistent staging area. Problems that some people have with this view/name seem to boil down to the definition of what a data warehouse is. I personally don’t have a problem with this name nor do I consider it to be less important. And yes it will qualify as a data warehouse.
Furthermore Dimensional modeling is generally acknowledged to be the best way to publish data to the business users and has a wide-spread adoption. A focus for further acceptance of Data Vault could be that it is the best implementation of a persistent staging area.
So let’s try to make this case … but first:
What are the reasons for having a persistent staging area?
A second reason which can be tied to the first has to do with keeping control of / reporting on data quality issues in the source systems and thus:
- identify possible improvements in processes. (e.g. same data entered twice)
- increase / decrease confidence in the data, information and decisions
The third reason has to do with agility. The ability to respond to changes rapidly. Small changes like changing a dimension attribute from type 1 to 2 or adding additional attributes to a dimension. But also large changes like big organizational changes, mergers, new ERP implementations. By having a persistent staging area (preferably integrated on business keys) it’ll be far easier to respond to these changes and rebuild the data marts (with history). Resulting in quicker, better reporting after such a change.
Of course you need to balance these advantages against the extra costs. Before we take a look at these extra costs and other raised disadvantages let’s examine some …
Similarities between Data Vault and Dimensional modeling
Let’s first focus on the methodology of both approaches. They both advocate a bottom-up iterative implementation: Identify scope, select business area, build the solution and .. focus on a new iteration.
Both use a limited number of strict table types with a well-defined function: Dimension- and fact tables in Dimensional modeling and hub-, link- and satellite tables in Data Vault modeling. Where a hub tables and its satellites typically roll-up into a dimension and a link tables and its satellites typically roll-up into a fact table.
Another “fit” is the use of a volatile staging area as a synchronization- and recovery point.
Focus on the extra costs
Some would argue that since you are building two data warehouses you’ll end up with twice the costs. That’s seems far fetched … remember both use a staging area and the existence of the Data Vault area will make it easier to construct the Dimensional model on top of that. Furthermore Data Vault modeling lends itself (as does Dimensional Modeling to some extent) to automation, which greatly reduces costs of building and maintaining the data warehouse solution.
Others complain about the extra storage that is needed to support this solution. Why store an extra copy of the source? Well I hope the described reasons for having a persistent staging area answer these questions. However you should of course take the extra storage costs into account when evaluating your choices. But please look at opportunity costs: Do you need an extra DBA or an extra SAN to support the extra layer?
A third complaint has to do with query performance. It is suggested that the extra joins introduced with Data Vault modeling will impact query performance. This of course depends on size, hardware, database and indexing strategy. But even if there is a significant impact in most cases it won’t have a major effect. After all we are only using these queries to populate the Dimensional model and this will only apply to new and changed rows.
Consider the combination of both methods when you are (re)building your data warehouse. In this blogpost I have given you some handles to evaluate this option. Do you need to keep control on data quality issues or do you want the agility to respond to changes quicker? Can you quantify these advantages? Compare them to the extra costs and make a better informed decision.
And of course you can cont(r)act me if you want any help in this evaluation.