Date Islands and Data Warehousing
Data warehouse design has a big component about exposing the data to users in a way that is both logical and easy to use. After all, no one wants to go into a warehouse and spend hours trying to look for the Ikea flat pack that is out of stock!
This blog post explores of the problems that date ranges and how to get around them using some Kimball-like thinking.
All about those dates
The classic example of dealing with date ranges is if you have multiple dimension tables with different date ranges and you need to join them up before joining with a fact table.
For example, you have two tables are per below, sourced from a CRM system:
Customer Contract Dimension table
|Customer ID||Contract GUID||Contract Start Date||Contract End Date|
As above, the table has date ranges for contracts, where Customer ID is foreign key to the above table, with a surrogate key to uniquely identify each row.
Customer Dimension table
|Status ID||Customer ID||Name||Status||Valid From||Valid To||Current Flag|
The Customer Dimension table uses a Kimball-style Slowly Changing Dimension (Type 2).
Purchase Orders Fact table
|Contract ID||Product||Purchase Date|
|12b4-39A7||Flat Pack Shelf||2011-12-17|
This fact table has all the purchases made, where the Contract ID is the primary key and part of the composite key in the Customer Contract table above.
Now, what if you wanted to query what the status of a customer was when the purchase order was made?
So many dates! I need an anchor!
With a date range problem, it is easier to visualise the ranges.
If we took the contract date ranges of one contract (12b4-39A7) and visualised it against the customer status, you would get this:
From this visualisation, you can see that if you want to check whether a status has at least a partial date range within the contract, it is a matter of comparing:
- Red Line -> Contract From Date with Status To Date
- Green Line -> Contract To Date with Status From Date
From there, it’s a simple comparison!
- If the Status To Date is AFTER the Contract From Date, it’s covered!
- If the Status From Date is BEFORE the Contract To Date, it’s covered!
A new dimension!
Now that we have the logic, we need to build a new dimension that has the appropriate surrogate key which brings both tables together! Using Kimball best practices, this will be an incrementing integer!
This means if you want to see a customer contract and status for a purchase date, you only need to do one join.
Customer Contract Status table
|SK||Status ID||Contract GUID||From Date||To Date|
As bolded in the above new dimension table, now there’s one date range we can reference - if you want to see the periods for the specific customer status and contracts, you just need to use the keys to join back to the above tables!
No need to do anymore date range joins! Also without the customer ID as a key, you don’t need to worry about accidental duplicates resulting from joining customer ID to multiple statuses/contracts.
For example, if customer ID 1 was used to join status and contracts, you would get at least 3 rows - potentially even 6 (2 rows in customer contract x 3 rows in customer status).
Bringing it into SQL!
Using the above logic, to get a result with all three tables joined, you would do:
Navigating the date ranges is not too difficult when you have an anchor! In my case, my go-to is always that visualisation!