I’ve been focusing a lot of my study time on data warehousing lately. I’ve been supporting the system and storage of data warehouses for a while but lately have been digging into the developer topics.
Dimensions are the tables we design to make data look good in a pivot chart. They are the tables that describe our facts. Customer is a good example of something that could be a dimension table. For my challenge I decided to use virtual machine as my dimension.
The problem is, what if a VM’s attributes change? 4 cores, that was yesterday.. today PRDSQLX has 24 cores. What if someone deletes a VM, how many cores did it have?
I can get the current status of my VMs by using the source system, but the problem is the history. I can pull a snapshot of what VMs I have in my environment every day from the source system. I could just make copies of that data and slap a “PollDate” column on the table. Viola, I have everything I need, and about 1000x more than I need.
There is the problem, how do I collect and save a history of my VM’s attributes?
Each column in my VM table can be of 3 basic types http://en.wikipedia.org/wiki/Dimension_table
Type 1. Simply overwrite this value… it changes a lot and I don’t care about history (eg. what host is the VM running on)
Type 2. add a new row to maintain history… if one column in my VM row changes, I get a whole new record in my dimension
Type 3. add a new column to keep a limited amount of history… add some columns like previous_num_cpus and previous_previous_num_cpus and move data to that as it changes
So we have to take the data we get on a nightly snapshot of the source, and compare it to what we have in the destination, then do a conditional split. I’m sticking to handling these differences:
New VM – insert with NULL validto (easy)
Deleted VM – change validto column (create staging table and do an except query)
Change in Type 1 Col – update existing VM row with NULL validto column, (easy)
Change in Type 2 Col – insert new row with NULL validto column, change previous record’s validto date (a little tricky)
That logical split can be made easier by using the Slowly Changing Dimension task in SSDT. It pops up a wizard to help you along the way and completely set you up for several failures which I am going to let you learn on your own :]
Step 1. Setup an initial loading package.
This will make it handy to restart your development.
Query the source in a data flow OLE DB Source
Tack on a few extra columns, validfrom, validto, isdeleted, sourcesystemid in the SQL command
create the destination table using the new button ( this is pretty handy to avoid manually lining up all datatypes )
use the new button again to create a dimVM_staging table for later
Add the task at the beginning of the control flow to truncate destination or dimVM table
Run the package and be careful not to accidentally run it since it has a truncate
Step 2. Create this monstrosity
It is actually not too terribly bad. When you add the Slowly Changing Dimension a wizard pops up and when all the stars align, all the data flow transformations and destination below are created.
If we focus on the top of the data flow first, it is easy to see I am pulling from two source systems and doing a union all. The interesting problem I had to solve was the deleted VM problem. The wizard didn’t do that for me. I knew if I had the staging table, I could compare that to the dimVM to see if anything was missing. If you want to find out what is missing, use an EXCEPT query. Once you find out what is missing (deleted VMs) we can update the validto field effectively closing up shop on that row but keeping the history of rows relating to that VM. I decided to add the isdeleted column to make it easier to find deleted VMs. This code is in the SQL Script task on the control flow.
update dimVM set dimVM.validto = getdate(), dimVM.isdeleted = 1 from dimVM inner join ( select vmid,vcenter from dimVM where validto is null except select vmid,vcenter from dimVM_staging ) del on dimVM.vmid = del.vmid and dimVM.vcenter = del.vcenter
One last little tidbit. If you make any modifications to the transformations that the SCD wizard created, you should document them with an annotation. If for some reason you have to get back into the wizard, it will recreate those transformations from scratch… ironically not maintaining any history.
Step 3. Profit
I hope you enjoyed hearing about my new experiences in the Slowly Changing Dimension transformation in SSDT.