IN2BI, Microsoft Business Intelligence
data, stories & insights
rss | email | twitter
  • Mon
    14
    May 12

    Four Approaches to Data Warehousing

    The requirements of a Data Warehouse solution continually change during its life time. For some organizations you’ll typically expect more changes than others. Think of maturity, history of organizational changes: mergers, acquisitions etc.

    You should adopt a solution type that will yield the best value taking these things into account.

    Data Warehouse Solutions

    When confronted with a new data warehouse solution that I am going to design or redesign I will typically choose between one of the following four approaches:

    1. One-Shot Data Warehouse
    2. Keeping History in Dimensions
    3. Historical / Persistent Staging Area
    4. Data Vault

    In this blog post I’ll briefly dive into these approaches and discuss some of the implementation issues.

    The choice for a specific approach depends on a number of circumstances, notably:

    • The expected total size of the solution
    • The number of source systems and how the information in them overlap
    • The average expected life-time of source systems: source system volatility
    • Current user requirements
    • Expected number of changes in user requirements: requirements volatility

     

    1. One-Shot Data Warehouse (small- and mid-sized solution)

    gunCurrent computing possibilities will typically enable the complete build (from scratch) of a data warehouse solution for small- and mid-sized solutions within the typically nightly maintenance window. This can be a very efficient solution with a high return on investment. Some people call it the destroy and rebuild approach since you are removing all previous data from the data warehouse before rebuilding it.

    An obvious disadvantage of this approach is that there is no track of changes in the source system: If a customer moves from London to Paris all the historical sales will be reported as Paris sales. Another disadvantage is the sheer amount of data that is transferred every night. That can accumulate to an amount that can not been loaded in the nightly maintenance window.

    2. Keeping History in Dimensions

    The problem of keeping track of history has been a major issue in data warehousing. In the theories  by Ralph Kimball Slowly Changing Dimensions play an import role. In his books he mentioned some patterns to handle history in dimensions.

    Type 1: Change history

    For some attributes the business isn’t interested in keeping historical information. This of course is most appropriate when correcting previous errors. Or in other situations where there is no business value in keeping track of historical information. For example when a person’s name changes due to marriage. The rows that were imported earlier in the dimension table are corrected by an UPDATE statement.

    Type 2: Keep history

    In this pattern history is preserved because a new record is entered for every change in a dimension attribute. The old row is marked as inactive and an end date is added to the row. Any new fact tables records that are inserted after this dimension row is changed will link to the newly added row.

    Other types

    Type 1 and Type 2 are the most commonly used. However there are some alternatives. They are described in this Wikipedia article.

    Other advantages

    Besides the tracking of history another major advantage is that you’ll only need changed and new rows (the delta rows) from your source system. And thus diminishing the time you need to load the data in your data warehouse. However getting only these delta rows can be challenging.

    3. Historical / Persistent Staging Area

    In this approach you’ll typically get a copy of all relevant source tables and add temporal information in a ‘StartDate’ and a ‘EndDate’ column. Again you’ll only process the delta rows: loading new and changed rows. And whenever a row has changed you’ll end date the old row.

    Based upon this historical staging area you can adopt method 1 or 2 to load the data in your dimensional model.

    What are the reasons for having a persistent staging area?

    In an earlier blog post I gave this answer:

    Well …first of all it could be a demand from auditors or data governance initiatives. Possible driven by external regulations.(Sarbanes-Oxley, Basel I, Basel II, HIPAA, ..)

    A second reason which can be tied to the first has to do with keeping control of / reporting on data quality issues in the source systems and thus:

    • Identify possible improvements in processes. (e.g. same data entered twice)
    • Increase / decrease confidence in the data, information and decisions

    The third reason has to do with agility. The ability to respond to changes rapidly. Small changes like changing a dimension attribute from type 1 to 2 or adding additional attributes to a dimension. But also large changes like big organizational changes, mergers, new ERP implementations. By having a persistent staging area  it’ll be far easier to respond to these changes and rebuild the data marts (with history). Resulting in quicker, better reporting after such a change.

    4. Data Vault

    The Data Vault is a special implementation of a persistent staging area. Source tables are split based on column aspects separating the structural items (business keys and the associations between the business keys) from the descriptive attributes.

    And thus the same advantages apply to the Data Vault as well. The power of the Data Vault comes from the integration on business keys. This approach is therefore especially suitable if you have source systems with overlapping information and/or if source systems change frequently over time.

    This wiki article by Ronald Kunenberg gives a nice overview of the Data Vault Methodology with a lot of additional references.

    Conclusion

    In this article I discussed some aspects that you should consider when you choose an approach for (re)designing your Data Warehouse. This table summarizes these aspects:

    Aspect /  Approach: 1 2 3 4
    Simple, fast solution x      
    Only process delta rows   x x x
    Keep track of historical information   x x x
    Solve auditing / data governance issues     x x
    Control over data quality in source systems     x x
    Easily adapt to changing user requirements     x x
    Source systems with overlapping information       x
    Frequently changing source systems       x
  • Wed
    02
    May 12

    Batch File To Start/Stop SQL Server

    I figured this out a few years ago but forgot most of it. With a new blazing fast lap-top and only one instance of SQL Server running these services automatically at start-up wasn’t a big problem.

    But now after installing SQL2012 I felt the need to set these services to manually start-up and wrote four batch-files:

    • Start_SQL2008R2.bat
    • Stop_SQL2008R2.bat
    • Start_SQL2012.bat
    • Stop_SQL2012.bat

    Whenever I need one of these instances I right-click the relevant bat file and Run them as administrator.

    This is the contents of Start_SQL2008R2.bat to start my SQL Server instanced with the name SQL2008R2.:

    net start MSSQL$SQL2008R2
    net start SQLAgent$SQL2008R2
    net start MSOLAP$SQL2008R2
    net start MsDtsServer100
    net start ReportServer$SQL2008R2
    pause

    And this is the contents of my Stop_SQL2008R2.bat file.

    net stop SQLAgent$SQL2008R2
    net stop MSSQL$SQL2008R2
    net stop MSOLAP$SQL2008R2
    net stop MsDtsServer100
    net stop ReportServer$SQL2008R2
    pause

    Notice the order (Stop SQL Agent now before SQL Server)

  • Fri
    13
    Apr 12

    SSIS Pattern: Drop- and rebuild indexes dynamically

    Fact- and dimension tables in our data warehouses are queried often. The queries used in our reports and the queries used for processing SSAS cubes create a large workload of select queries. These “select” queries benefit from indexes and will return results faster with appropriate indexes.

    However a lot of indexes are a pain when you load the date into your data warehouse. A good practices is therefore to drop your NONCLUSTERED indexes at the start of loading your dimensions and fact tables and rebuilt them after loading has finished.

    The pain of your DBA

    Your DBA will typically change these indexes periodically based on his performance tuning efforts of the latest workload. He will be very disappointed when after changing these indexes he is confronted with the old indexes after a rebuild of indexes.

    clip_image018So we need some way to:

    • script the currently used indexes
    • store them
    • drop them
    • load the dimensions
    • load our fact tables
    • rebuild the indexes again.

    In this blog post I walk you to the creation of the package with this control flow:

    But first let’s focus on the SQL script we need to capture the current indexes. Unfortunately you can not use the SQL Function OBJECT_DEFINITION to generate the code necessary to create an index since that object isn’t supported. So we have to get creative.

    The next SQL Statement is based on this blogpost by Simon Sabin. It will return a dataset with 5 columns:

    • SchemaName
    • ObjectName
    • IndexName
    • DropIndexStatement
    • CreateIndexStatement

    clip_image002

    The SQL Statement to return this list of indexes:


    SELECT SchemaName = SCHEMA_NAME (o.SCHEMA_ID)
       ,
    ObjectName = o.name
      
    ,IndexName = i.name
      
    ,DropIndexStatement = 'DROP INDEX [' + i.Name + '] ON [' + SCHEMA_NAME (o.SCHEMA_ID)
       +
    '].[' + o.name + ']'
      
    ,CreateIndexStatement = 'CREATE ' + (CASE i.is_unique WHEN 1 THEN 'UNIQUE ' ELSE '' END)
       +
    'NONCLUSTERED INDEX ['
      
    + i.name + '] ON [' + SCHEMA_NAME (o.SCHEMA_ID) + '].[' + o.name + '] ( '
      
    + LEFT(list, ISNULL(splitter-1,LEN(list))) + ' ) '
      
    + ISNULL('INCLUDE ( ' + SUBSTRING(list, indCol.splitter +1, 100) + ' ) ','')
       +
    ISNULL('WHERE ' + i.filter_definition,'')
    FROM sys.indexes i
    JOIN sys.objects o ON i.OBJECT_ID = o.OBJECT_ID
    CROSS APPLY
    (SELECT splitter = NULLIF(CHARINDEX('|',indexCols.list),0)
    ,
    list
    FROM (SELECT list=CAST((
      
    SELECT CASE WHEN sc.is_included_column = 1 AND sc.ColPos = 1 THEN '|' ELSE '' END +
        
    CASE WHEN sc.ColPos  > 1 THEN ', ' ELSE '' END + name +
               
    CASE WHEN sc.is_descending_key=1 THEN ' DESC' ELSE '' END
                FROM
    (SELECT sc.is_descending_key
          
    , sc.is_included_column
          
    , index_column_id
          
    , name = '[' + name + ']'
          
    , ColPos = ROW_NUMBER() OVER (PARTITION BY sc.is_included_column
              
    ORDER BY sc.index_column_id)
            
    FROM sys.index_columns  sc
            
    JOIN sys.columns c ON sc.OBJECT_ID = c.OBJECT_ID
              
    AND sc.column_id = c.column_id
            
    WHERE sc.index_id = i.index_id
                       
    AND sc.OBJECT_ID = i.OBJECT_ID ) sc
                       
    ORDER BY sc.is_included_column, ColPos
                         
    FOR XML PATH (''), TYPE) AS VARCHAR(MAX))
       )
    indexCols
    ) indCol
    WHERE i.is_primary_key=0
    AND i.is_unique_constraint=0
    AND i.type_desc=N'NONCLUSTERED'
    AND o.TYPE=N'U'
    ORDER BY SchemaName, ObjectName, IndexName

    The Steps to change/create the master package:

    1. Open the master package that you use to load the dimensions and facts.

    2. Add Variables:

    • IndexList with Data Type=Object
    • DropIndexStatement with Data Type =String
    • CreateIndexStatement with Data Type =String
      clip_image004


    3. Add an Execute SQL Task to the Control Flow

    • Change Name to Get IndexList
    • Change ResultSet to Full result set
    • Select the target connection under Connection
    • Copy the SQL statement (see above) as the SQLStatement
      clip_image006

     

    • Select the tab Result Set
    • Choose Add and use 0 as the Result Name and Select User::IndexList as the Variable Name 
      clip_image008


    4. Add a Foreach Loop Container to the Control Flow and double click on it to open the Foreach Loop Editor

    • Under the tab General change Name and Description to Drop each index
    • Under the tab Collection change Enumerator to Foreach ADO Enumerator and select User::IndexList from the dropdown box as the ADO Object source variable clip_image010

     

    • Under the tab Variable Mappings select User::DropIndexStatement and type 3 as Indexclip_image012


    5. Drop an Execute SQL Task in the Drop each index container and double click on it to open the Execute SQL Task Editor.

    • Change Name and Description to Drop Index
    • Select the target connection under Connection
    • Select Variable as the SQLSourceType
    • Select User::DropIndexStatement as the SourceVariable clip_image014


    6. Add a Foreach Loop Container to the Control Flow and double click on it to open the Foreach Loop Editor

    • Use Create each index as name for the Foreach loop container
    • In the Collection tab of the Foreach Loop Editor select Foreach ADO Enumerator and use the User::IndexList as ADO Object source variable
    • In the Variable Mappings tab Select User::CreateIndexStatement as variable and 4 as index


    7. Drop an Execute SQL Task in the Drop each index container and double click on it to open the Execute SQL Task Editor.

    • Change Name and Description to Create Index
    • Select your Connection under Connection
    • Select Variable as the SQLSourceType
    • Select User::CreateIndexStatement as the SourceVariableclip_image016


    8. Change the Precedence Constraints to map the example of the afore mentioned control flow.

  • Thu
    03
    Nov 11

    SQL Server 2012 Editions and Licensing

    image

    Microsoft published the licensing options and prices. Two aspects stand out:

    • A new edition: Business Intelligence (between Standard- and Enterprise Edition)
    • A move to core based licensing (from processor based)

    The main editions and the differences in functionality:


    Standard Edition:

    • Basic OLTP
    • Basic Reporting & Analytics
    • Programmability
    • Manageability
    • Basic High Availability


    Business Intelligence Edition:

    Standard Edition +:

    • Data Quality Services
    • Master Data Services
    • Power View
    • PowerPivot for Sharepoint Portal Server
    • Semantic model, advanced analytics


    Enterprise Edition

    Business Intelligence Edition +:

    • Advanced Security
    • Advanced High availability
    • ColumnStore indexing
    • Compression
    • Partitioning

     

    US Pricing:

    This is an overview of the expected prices:

    image

  • Mon
    24
    Oct 11

    SQL Server 2012 (Denali)

    Microsoft made significant Business Intelligence investments in SQL Server since 2000.

    The next (2012) release will also include many new BI features. In the next posts I will explore some of these features. This post will act as an overview of / index to these articles:

    • …..
  • Sat
    22
    Oct 11

    Sankey Diagrams in Google Analytics

    According to Visualization Expert Edward Tufte This Sankey diagram is “Probably the best statistical graphic ever drawn”

    image

    It tells the story of the losses suffered by Napoleon's army in the Russian campaign of 1812. Beginning at the Polish-Russian border, the thick band shows the size of the army at each position. The path of Napoleon's retreat from Moscow in the bitterly cold winter is depicted by the dark lower band.

    Google recently announced an addition to Google Analytics that uses the same visualization technique to show you how your users move across you site. The call it Flow visualization.

    Some pictures:

    image

    Browser – Product Catalog – Shopping Cart – Login – Order

    image

    Country (France selected) - Product Catalog – Shopping Cart – Login – Order

    A very nice addition with which we (website owners) get to play within the next weeks… Looking forward to that.

  • Mon
    17
    Oct 11

    TSQL Metadata Queries

    When you want to query SQL Server for the metadata you basically have two options:

    • Use the Catalog views
    • Use the Information Schema Views

    The last one are based on catalog view definitions in the ISO standard. You typically use them if you need to support more database systems.

    The first is one is recommended by Microsoft.

    In this blogpost I recapitulate some queries that I have used in the past and will probably use in the future.

    1. Get column information from all tables

    -----------------------------------------------------------
    -- Query: GetColumnInfoFromAllTables
    -- By Marco Schreuder 27-09-2011
    -----------------------------------------------------------
    SELECT SchemaName = SCHEMA_NAME(T.schema_id)
       ,
    TableName = T.name
      
    ,ColumnName = C.name
      
    ,ColumnID = C.column_id
      
    ,DataType = ty.name
      
    ,MaxLength = C.max_length
      
    ,PRECISION = C.PRECISION
      
    ,Scale = C.scale
      
    ,CondensedDataType=(CASE
          
    WHEN (ty.name IN ('char','nchar','varchar','nvarchar') AND C.max_length=-1)
                  THEN ty.name + '(MAX)'
          
    WHEN (ty.name IN ('char','nchar','varchar','nvarchar') AND C.max_length>0)
                 THEN ty.name + '(' + CONVERT(VARCHAR(20),C.max_length) + ')'
          
    WHEN ty.name IN ('decimal','numeric')
                 THEN ty.name + '(' + CONVERT(VARCHAR(20),C.PRECISION) + ',' +
                      
    CONVERT(VARCHAR(20),C.scale) + ')'
          
    WHEN ty.name IN ('datetime2','datetimeoffset')
                
    THEN ty.name + '(' + CONVERT(VARCHAR(20),C.scale) + ')'
          
    ELSE ty.name
          
    END )
       ,
    IsIdentity = c.is_identity
      
    ,IsNullable = c.is_nullable
      
    ,ModifiedDate = T.modify_date
    FROM sys.tables T
    INNER JOIN sys.columns C ON C.OBJECT_ID=T.OBJECT_ID
    INNER JOIN sys.types ty ON C.system_type_id=ty.user_type_id
    ORDER BY SCHEMA_NAME(T.schema_id),T.Name,C.column_id

    This will return this result set:

    image

    Notice the join between sys.types and sys.columns on type.user_type_id. Otherwise you would get duplicate rows if there are user types defined.

    2. Get column information from all views

    Almost the same as 1. Replacing sys.tables with sys.views:

    -----------------------------------------------------------
    -- Query: GetColumnInfoFromAllViews
    -- By Marco Schreuder 27-09-2011
    -----------------------------------------------------------
    SELECT SchemaName = SCHEMA_NAME(V.schema_id)
       ,
    ViewName = V.name
      
    ,ColumnName = C.name
      
    ,ColumnID = C.column_id
      
    ,DataType = ty.name
      
    ,MaxLength = C.max_length
      
    ,PRECISION = C.PRECISION
      
    ,Scale = C.scale
      
    ,CondensedDataType=(CASE
          
    WHEN (ty.name IN ('char','nchar','varchar','nvarchar') AND C.max_length=-1)
                
    THEN ty.name + '(MAX)'
          
    WHEN (ty.name IN ('char','nchar','varchar','nvarchar') AND C.max_length>0)
                
    THEN ty.name + '(' + CONVERT(VARCHAR(20),C.max_length) + ')'
          
    WHEN ty.name IN ('decimal','numeric') THEN ty.name + '(' + CONVERT(
                
    VARCHAR(20),C.PRECISION) + ',' + CONVERT(VARCHAR(20),C.scale) + ')'
          
    WHEN ty.name IN ('datetime2','datetimeoffset')
                
    THEN ty.name + '(' + CONVERT(VARCHAR(20),C.scale) + ')'
          
    ELSE ty.name
          
    END )
       ,
    IsIdentity = c.is_identity
      
    ,IsNullable = c.is_nullable
      
    ,ModifiedDate = V.modify_date
    FROM sys.views V
    INNER JOIN sys.columns C ON C.OBJECT_ID=V.OBJECT_ID
    INNER JOIN sys.types ty ON C.system_type_id=ty.user_type_id
    ORDER BY V.Name,C.column_id

    Resulting in:

    image

    3. Get details of foreign key constraints

    -----------------------------------------------------------
    -- Query: GetFKDetails
    -- By Marco Schreuder 27-09-2011
    -----------------------------------------------------------
    SELECT FKName = f.name
      
    ,FKSchema = OBJECT_SCHEMA_NAME(f.parent_object_id)
       ,
    FKTable = OBJECT_NAME(f.parent_object_id)
       ,
    FKColumn = COL_NAME(f.parent_object_id,fc.parent_column_id)
       ,
    RefSchema = OBJECT_SCHEMA_NAME(f.referenced_object_id)
       ,
    RefTable = OBJECT_NAME(f.referenced_object_id)
       ,
    RefColumn = COL_NAME(fc.referenced_object_id,fc.referenced_column_id)
       ,
    ConstraintColumnID = fc.constraint_column_id
      
    ,ModifiedDate = f.modify_date
    FROM sys.foreign_keys f
    INNER JOIN sys.foreign_key_columns fc
      
    ON f.OBJECT_ID = fc.constraint_object_id

    returns:

    image

    4. Get details of the indices created on user tables

    -----------------------------------------------------------
    -- Query: GetIndexDetails
    -- By Marco Schreuder 27-09-2011
    -----------------------------------------------------------
    SELECT IndexName = i.name
      
    ,SchemaName = OBJECT_SCHEMA_NAME(i.OBJECT_ID)
       ,
    ObjectName = OBJECT_NAME(i.OBJECT_ID)
       ,
    ColumnName = COL_NAME(i.OBJECT_ID,ic.column_id)
       ,
    IndexType = i.type_desc
      
    ,IndexID = i.index_id
      
    ,IndexColumnID = ic.index_column_id
      
    ,KeyOrdinal = ic.key_ordinal
      
    ,IsIncludedColumn = ic.is_included_column
      
    ,IsUnique = i.is_unique
      
    ,IsPrimaryKey = i.is_primary_key
    FROM sys.indexes i
    INNER JOIN sys.index_columns ic
      
    ON i.OBJECT_ID = ic.OBJECT_ID AND i.index_id=ic.index_id
    --limit to indexes on user tables
    INNER JOIN sys.tables t
      
    ON i.OBJECT_ID = t.OBJECT_ID

    which returns:

    image

  • Wed
    12
    Oct 11

    Biml versus CodeSmith and MyGeneration

    After a recent conference a participant pointed me to CodeSmith and MyGeneration as alternatives for BimlScript. And although I don’t have any experience with these tools from their website I understand he is right. You could use these tools in a similar fashion.

    Both are generic code generators and given a document with a certain xml or other format they’ll be able to loop through a dataset and inject values to the original document. So you could use a certain ssis package… Investigate the xml… Add placeholders… Loop through a dataset … replace the placeholders in the ssis package with the information from your dataset … and save the newly created documents.

    One important disadvantage of this approach is the scrutiny of the rather complex xml in an ssis package. Biml offers an easier/cleaner way to describe a package because it doesn’t need to hold design aspect of the BIDS interface of the package.

    An example of Biml with BimlScript:

    image

  • Tue
    11
    Oct 11

    Data Vault Automation Conference

    in2bi was one of the sponsors of the Data Vault Automation Conference which was held in Utrecht on October 6th.

    Data Vault has become pretty popular in the Netherlands the last years. Due to the rigid modular approach of Data Vault modeling source data can easily be divided in the three standard table types with standard load mechanisms which lead to easy automation.

    Both aspects contribute to a lot of innovation in the Netherlands with respect to data warehouse automation.

    During the conference more than 100 participants where kept up to date by:

    You can download the slides of my session by clicking on the powerpoint logo.PowerPoint

  • Mon
    10
    Oct 11

    SQL Bits 2011

    SQLBitsLogo

    From 29.09.2011 until 01.10.2011 I visited SQL Bits 9 in Liverpool. I had a great time: beautiful weather, well organized event and some great sessions… and I met some great- and nice people.

    In my own session “Creating a Meta Data Driven SSIS Solution with Biml” I tried to cover to much ground. I’ll try not to make that mistake again and focus more in depth on fewer aspects next time.

    You can download the slides of my session by clicking on the powerpoint logo.PowerPoint