IN2BI, Microsoft Business Intelligence
data, stories & insights
rss | email | twitter
  • wo
    14
    nov 12

    DWH Deck 2.0: Step 3 Publish the jobs to create the SQL Statements and BIML Files

    The DWH Deck is a simple tool that allows you to build and maintain your data warehouse better and faster.

    In previous blog posts we created connections and jobs that describe the source to target mapping. In this blog post we’ll examine the effects of publishing the jobs.

    • Click Publish on the menu bar
    • Check the created files in (default location:) C:\in2bi\DWHDeck
    • Check the created tables and stored procedures in the NorthwindDW1 database.

    image

    image

    In this step we examined the effects of publishing the jobs: The SQL statements that are executed and the BIML files that are build.
    In the next post I’ll describe how you create SSIS Packages from these BIML files.

    Overview of the DWH Deck 2.0

  • di
    13
    nov 12

    DWH Deck 2.0: Step 4 Create SSIS packages from the created BIML Files

    The DWH Deck is a simple tool that allows you to build and maintain your data warehouse better and faster.

    In previous blog posts we created connections and jobs that describe the source to target mapping. In this step we will build SSIS packages from the published BIML files.

    To create SSIS Packages from BIML files you need to install Bidshelper. This is a free and popular add-in for SQL Server Data Tools (the successor of the Business Intelligence Development Studio)

    To create the SSIS Packages for the BIML files we created in step 3 we’ll need to:

    • In Visual Studio create a new Integration Services Project with the name: SSIS NorthwindDW1
    • Make sure the ProtectionLevel of the project is set to DontSaveSensative. Right-Click the project and select properties to change this property.
    • Right-Click SSIS NorthwindDW1 in the Solution Explorer. Select Add and Existing Item…
    • Select all the files in C:\in2bi\DWHDeck\Create BIML Package using Shift and next click Add
    • Select the BIML files under Miscellaneous in the Solution Explorer. Right-Click and select Generate SSIS Packages.

    Test the created packages:

    • To import the data using the SSIS Packages: Right-Click the MasterPackage.dtsx and select Execute Package
    • To import the data using the stored procedures: Right-Click the MasterProcedures.dtsx and select Execute Package

    image

    In step 1 – 4 we looked at the basics of the DWH Deck. We created staging tables, stored procedures and SSIS Packages to import the data from the source system.

    In the next blog post we will take an in-depth look at the templates

    Overview of the DWH Deck 2.0

  • za
    01
    sep 12

    test

  • do
    12
    jul 12

    Things I Learn(ed) From My First Data Warehouse

    Some fifteen years ago I created my first data warehouse. This was after a decade in several financial functions like auditing, controlling and reporting.

    The experienced consultant that reviewed my data warehouse had a lot of comments. Why didn’t I incorporate more attributes he asked? There must be more information in the source systems … what about age, gender, occupation, marital status of the people in this dimension?

    more_attributesHe was right of course but as the business expert I didn’t feel the need. I hadn’t needed this information in my reports and I had my doubts as to the data quality of these attributes in the source systems.

    Learning from criticism

    But of course I learned from his advice and in later projects I found myself convincing customers that they needed these attributes or would probably need them in the future. And if we didn’t add them now it would cost much more in a future update.

    Revisiting my learned behavior

    So me and my team spend a lot of time extracting attributes, cleansing them and loading them in dimensions. Spending a lot of time and customer’s money on adding attributes they didn’t need in the first place. Nowadays I will suggest a more agile approach: loading the data as is in an historical staging area or a data vault and only cleanse/add them to the dimensional model when needed.

  • ma
    04
    jun 12

    TSQL: MERGE With EXCEPT to Handle Type 2 Loading

    Type 2 Loading is a pattern that is often used to incrementally load dimensions in a data warehouse. But you can also use it for loading satellites in a data vault or loading tables in a historical/persistent staging area. It consist of three parts:

    1. Identify new rows and add them to the target database.
    2. Identify changed rows and add them to the target database.
    3. End date the old rows in the target database that have been replaced with those of step 2.

     

    MERGE

    There are several ways to implement a Type 2 loading pattern. In this blog post I’ll describe using the TSQL statement: MERGE. I’ve written earlier rather critical about using the SQL MERGE Statement (in Dutch) Since then I have come to like it and compared to other methods it seems to be a good performing option for handling slowly changing dimension type solutions. (See this study from Microsoft’s Matt Mason)

    Warren Thornthwaite from the Kimball Group did a good explanation on using the SQL MERGE Statement for Slowly Changing Dimension in  Design Tip #107 Using the SQL MERGE Statement for Slowly Changing Dimension Processing.

    In his example for processing type 2 dimension he uses a sample that is similar to the one below:

    INSERT INTO dimCustomer (AuditID,StartDate,EndDate, CustomerID, CustomerName, Adress..)
    SELECT @AuditID, GETDATE(),NULL,CustomerID, CustomerName, Adress,..
    FROM

    ( MERGE DWH.dbo.dimCustomer tgt
    USING STG.dbo.Customer src
    ON src.CustomerID = tgt.CustomerID

    WHEN NOT MATCHED THEN
    INSERT VALUES (@AuditID, GETDATE(), NULL, CustomerID, CustomerName, Adress ..)

    WHEN MATCHED
    AND tgt.__EndDate IS NULL
    AND (tgt.CustomerName <> src.CustomerName
    AND (tgt.Adress <> src.Adress)
    ..
    THEN
    UPDATE SET tgt.EndDate = GETDATE()

    OUTPUT $Action Action_Out, src.CustomerID, CustomerName, Adress ..
    ) AS MERGE_OUT
    WHERE MERGE_OUT.Action_Out = 'UPDATE';

    The outcome of the MERGE statement is used as source for the INSERT INTO statement at the top.
    (Afore mentioned step 2: Identify changed rows and add them to the target database)
    In the Merge statement there are two actions:
    1.   Insert any new rows
    3.   End date the rows in the source that have changed.

    NULL <> NULL

    A problem with this example is the implicit assumption that there are no NULL values stored in the columns. Because otherwise the highlighted expressions with nulls would yield to true (NULL<>NULL) and rows would unnecessary be added to the model.

    There are however two SQL expressions that consider NULL Values as equal: INTERSECT and EXCEPT. So we can easily circumvent this problem by changing the highlighted rows into:

    AND EXISTS (SELECT  tgt.CustomerName, tgt.Adress …
    EXCEPT SELECT  src.CustomerName, src.Adress …)

  • ma
    14
    mei 12

    Four Approaches to Data Warehousing

    The requirements of a Data Warehouse solution continually change during its life time. For some organizations you’ll typically expect more changes than others. Think of maturity, history of organizational changes: mergers, acquisitions etc.

    You should adopt a solution type that will yield the best value taking these things into account.

    Data Warehouse Solutions

    When confronted with a new data warehouse solution that I am going to design or redesign I will typically choose between one of the following four approaches:

    1. One-Shot Data Warehouse
    2. Keeping History in Dimensions
    3. Historical / Persistent Staging Area
    4. Data Vault

    In this blog post I’ll briefly dive into these approaches and discuss some of the implementation issues.

    The choice for a specific approach depends on a number of circumstances, notably:

    • The expected total size of the solution
    • The number of source systems and how the information in them overlap
    • The average expected life-time of source systems: source system volatility
    • Current user requirements
    • Expected number of changes in user requirements: requirements volatility

     

    1. One-Shot Data Warehouse (small- and mid-sized solution)

    gunCurrent computing possibilities will typically enable the complete build (from scratch) of a data warehouse solution for small- and mid-sized solutions within the typically nightly maintenance window. This can be a very efficient solution with a high return on investment. Some people call it the destroy and rebuild approach since you are removing all previous data from the data warehouse before rebuilding it.

    An obvious disadvantage of this approach is that there is no track of changes in the source system: If a customer moves from London to Paris all the historical sales will be reported as Paris sales. Another disadvantage is the sheer amount of data that is transferred every night. That can accumulate to an amount that can not been loaded in the nightly maintenance window.

    2. Keeping History in Dimensions

    The problem of keeping track of history has been a major issue in data warehousing. In the theories  by Ralph Kimball Slowly Changing Dimensions play an import role. In his books he mentioned some patterns to handle history in dimensions.

    Type 1: Change history

    For some attributes the business isn’t interested in keeping historical information. This of course is most appropriate when correcting previous errors. Or in other situations where there is no business value in keeping track of historical information. For example when a person’s name changes due to marriage. The rows that were imported earlier in the dimension table are corrected by an UPDATE statement.

    Type 2: Keep history

    In this pattern history is preserved because a new record is entered for every change in a dimension attribute. The old row is marked as inactive and an end date is added to the row. Any new fact tables records that are inserted after this dimension row is changed will link to the newly added row.

    Other types

    Type 1 and Type 2 are the most commonly used. However there are some alternatives. They are described in this Wikipedia article.

    Other advantages

    Besides the tracking of history another major advantage is that you’ll only need changed and new rows (the delta rows) from your source system. And thus diminishing the time you need to load the data in your data warehouse. However getting only these delta rows can be challenging.

    3. Historical / Persistent Staging Area

    In this approach you’ll typically get a copy of all relevant source tables and add temporal information in a ‘StartDate’ and a ‘EndDate’ column. Again you’ll only process the delta rows: loading new and changed rows. And whenever a row has changed you’ll end date the old row.

    Based upon this historical staging area you can adopt method 1 or 2 to load the data in your dimensional model.

    What are the reasons for having a persistent staging area?

    In an earlier blog post I gave this answer:

    Well …first of all it could be a demand from auditors or data governance initiatives. Possible driven by external regulations.(Sarbanes-Oxley, Basel I, Basel II, HIPAA, ..)

    A second reason which can be tied to the first has to do with keeping control of / reporting on data quality issues in the source systems and thus:

    • Identify possible improvements in processes. (e.g. same data entered twice)
    • Increase / decrease confidence in the data, information and decisions

    The third reason has to do with agility. The ability to respond to changes rapidly. Small changes like changing a dimension attribute from type 1 to 2 or adding additional attributes to a dimension. But also large changes like big organizational changes, mergers, new ERP implementations. By having a persistent staging area  it’ll be far easier to respond to these changes and rebuild the data marts (with history). Resulting in quicker, better reporting after such a change.

    4. Data Vault

    The Data Vault is a special implementation of a persistent staging area. Source tables are split based on column aspects separating the structural items (business keys and the associations between the business keys) from the descriptive attributes.

    And thus the same advantages apply to the Data Vault as well. The power of the Data Vault comes from the integration on business keys. This approach is therefore especially suitable if you have source systems with overlapping information and/or if source systems change frequently over time.

    This wiki article by Ronald Kunenberg gives a nice overview of the Data Vault Methodology with a lot of additional references.

    Conclusion

    In this article I discussed some aspects that you should consider when you choose an approach for (re)designing your Data Warehouse. This table summarizes these aspects:

    Aspect /  Approach: 1 2 3 4
    Simple, fast solution x      
    Only process delta rows   x x x
    Keep track of historical information   x x x
    Solve auditing / data governance issues     x x
    Control over data quality in source systems     x x
    Easily adapt to changing user requirements     x x
    Source systems with overlapping information       x
    Frequently changing source systems       x
  • wo
    02
    mei 12

    Batch File To Start/Stop SQL Server

    I figured this out a few years ago but forgot most of it. With a new blazing fast lap-top and only one instance of SQL Server running these services automatically at start-up wasn’t a big problem.

    But now after installing SQL2012 I felt the need to set these services to manually start-up and wrote four batch-files:

    • Start_SQL2008R2.bat
    • Stop_SQL2008R2.bat
    • Start_SQL2012.bat
    • Stop_SQL2012.bat

    Whenever I need one of these instances I right-click the relevant bat file and Run them as administrator.

    This is the contents of Start_SQL2008R2.bat to start my SQL Server instanced with the name SQL2008R2.:

    net start MSSQL$SQL2008R2
    net start SQLAgent$SQL2008R2
    net start MSOLAP$SQL2008R2
    net start MsDtsServer100
    net start ReportServer$SQL2008R2
    pause

    And this is the contents of my Stop_SQL2008R2.bat file.

    net stop SQLAgent$SQL2008R2
    net stop MSSQL$SQL2008R2
    net stop MSOLAP$SQL2008R2
    net stop MsDtsServer100
    net stop ReportServer$SQL2008R2
    pause

    Notice the order (Stop SQL Agent now before SQL Server)

  • vr
    13
    apr 12

    SSIS Pattern: Drop- and rebuild indexes dynamically

    Fact- and dimension tables in our data warehouses are queried often. The queries used in our reports and the queries used for processing SSAS cubes create a large workload of select queries. These “select” queries benefit from indexes and will return results faster with appropriate indexes.

    However a lot of indexes are a pain when you load the date into your data warehouse. A good practices is therefore to drop your NONCLUSTERED indexes at the start of loading your dimensions and fact tables and rebuilt them after loading has finished.

    The pain of your DBA

    Your DBA will typically change these indexes periodically based on his performance tuning efforts of the latest workload. He will be very disappointed when after changing these indexes he is confronted with the old indexes after a rebuild of indexes.

    clip_image018So we need some way to:

    • script the currently used indexes
    • store them
    • drop them
    • load the dimensions
    • load our fact tables
    • rebuild the indexes again.

    In this blog post I walk you to the creation of the package with this control flow:

    But first let’s focus on the SQL script we need to capture the current indexes. Unfortunately you can not use the SQL Function OBJECT_DEFINITION to generate the code necessary to create an index since that object isn’t supported. So we have to get creative.

    The next SQL Statement is based on this blogpost by Simon Sabin. It will return a dataset with 5 columns:

    • SchemaName
    • ObjectName
    • IndexName
    • DropIndexStatement
    • CreateIndexStatement

    clip_image002

    The SQL Statement to return this list of indexes:


    SELECT SchemaName = SCHEMA_NAME (o.SCHEMA_ID)
       ,
    ObjectName = o.name
      
    ,IndexName = i.name
      
    ,DropIndexStatement = 'DROP INDEX [' + i.Name + '] ON [' + SCHEMA_NAME (o.SCHEMA_ID)
       +
    '].[' + o.name + ']'
      
    ,CreateIndexStatement = 'CREATE ' + (CASE i.is_unique WHEN 1 THEN 'UNIQUE ' ELSE '' END)
       +
    'NONCLUSTERED INDEX ['
      
    + i.name + '] ON [' + SCHEMA_NAME (o.SCHEMA_ID) + '].[' + o.name + '] ( '
      
    + LEFT(list, ISNULL(splitter-1,LEN(list))) + ' ) '
      
    + ISNULL('INCLUDE ( ' + SUBSTRING(list, indCol.splitter +1, 100) + ' ) ','')
       +
    ISNULL('WHERE ' + i.filter_definition,'')
    FROM sys.indexes i
    JOIN sys.objects o ON i.OBJECT_ID = o.OBJECT_ID
    CROSS APPLY
    (SELECT splitter = NULLIF(CHARINDEX('|',indexCols.list),0)
    ,
    list
    FROM (SELECT list=CAST((
      
    SELECT CASE WHEN sc.is_included_column = 1 AND sc.ColPos = 1 THEN '|' ELSE '' END +
        
    CASE WHEN sc.ColPos  > 1 THEN ', ' ELSE '' END + name +
               
    CASE WHEN sc.is_descending_key=1 THEN ' DESC' ELSE '' END
                FROM
    (SELECT sc.is_descending_key
          
    , sc.is_included_column
          
    , index_column_id
          
    , name = '[' + name + ']'
          
    , ColPos = ROW_NUMBER() OVER (PARTITION BY sc.is_included_column
              
    ORDER BY sc.index_column_id)
            
    FROM sys.index_columns  sc
            
    JOIN sys.columns c ON sc.OBJECT_ID = c.OBJECT_ID
              
    AND sc.column_id = c.column_id
            
    WHERE sc.index_id = i.index_id
                       
    AND sc.OBJECT_ID = i.OBJECT_ID ) sc
                       
    ORDER BY sc.is_included_column, ColPos
                         
    FOR XML PATH (''), TYPE) AS VARCHAR(MAX))
       )
    indexCols
    ) indCol
    WHERE i.is_primary_key=0
    AND i.is_unique_constraint=0
    AND i.type_desc=N'NONCLUSTERED'
    AND o.TYPE=N'U'
    ORDER BY SchemaName, ObjectName, IndexName

    The Steps to change/create the master package:

    1. Open the master package that you use to load the dimensions and facts.

    2. Add Variables:

    • IndexList with Data Type=Object
    • DropIndexStatement with Data Type =String
    • CreateIndexStatement with Data Type =String
      clip_image004


    3. Add an Execute SQL Task to the Control Flow

    • Change Name to Get IndexList
    • Change ResultSet to Full result set
    • Select the target connection under Connection
    • Copy the SQL statement (see above) as the SQLStatement
      clip_image006

     

    • Select the tab Result Set
    • Choose Add and use 0 as the Result Name and Select User::IndexList as the Variable Name 
      clip_image008


    4. Add a Foreach Loop Container to the Control Flow and double click on it to open the Foreach Loop Editor

    • Under the tab General change Name and Description to Drop each index
    • Under the tab Collection change Enumerator to Foreach ADO Enumerator and select User::IndexList from the dropdown box as the ADO Object source variable clip_image010

     

    • Under the tab Variable Mappings select User::DropIndexStatement and type 3 as Indexclip_image012


    5. Drop an Execute SQL Task in the Drop each index container and double click on it to open the Execute SQL Task Editor.

    • Change Name and Description to Drop Index
    • Select the target connection under Connection
    • Select Variable as the SQLSourceType
    • Select User::DropIndexStatement as the SourceVariable clip_image014


    6. Add a Foreach Loop Container to the Control Flow and double click on it to open the Foreach Loop Editor

    • Use Create each index as name for the Foreach loop container
    • In the Collection tab of the Foreach Loop Editor select Foreach ADO Enumerator and use the User::IndexList as ADO Object source variable
    • In the Variable Mappings tab Select User::CreateIndexStatement as variable and 4 as index


    7. Drop an Execute SQL Task in the Drop each index container and double click on it to open the Execute SQL Task Editor.

    • Change Name and Description to Create Index
    • Select your Connection under Connection
    • Select Variable as the SQLSourceType
    • Select User::CreateIndexStatement as the SourceVariableclip_image016


    8. Change the Precedence Constraints to map the example of the afore mentioned control flow.

  • do
    03
    nov 11

    SQL Server 2012 Editions and Licensing

    image

    Microsoft published the licensing options and prices. Two aspects stand out:

    • A new edition: Business Intelligence (between Standard- and Enterprise Edition)
    • A move to core based licensing (from processor based)

     

    The main editions and the differences in functionality:


    Standard Edition:

    • Basic OLTP
    • Basic Reporting & Analytics
    • Programmability
    • Manageability
    • Basic High Availability


    Business Intelligence Edition:

    Standard Edition +:

    • Data Quality Services
    • Master Data Services
    • Power View
    • PowerPivot for Sharepoint Portal Server
    • Semantic model, advanced analytics


    Enterprise Edition

    Business Intelligence Edition +:

    • Advanced Security
    • Advanced High availability
    • ColumnStore indexing
    • Compression
    • Partitioning

     

    US Pricing:

    This is an overview of the expected prices:

    image

  • ma
    24
    okt 11

    SQL Server 2012 (Denali)

    Microsoft made significant Business Intelligence investments in SQL Server since 2000.

    The next (2012) release will also include many new BI features. In the next posts I will explore some of these features. This post will act as an overview of / index to these articles:

    • …..