The gap is as important as the island
Identifying the gaps and how long a continuous date range is an important exercise a few industries/domains. For example, in the insurance industry, identifying whether a person has a policy covered continuously (and if not) for how long is important for calculating numerous things - e.g. eligibility, waiting periods, etc
But first a recap on what I wrote before!
Date ranges - a trip down memory lane!
In my previous blog post, I wrote about how to fine-tune a list of date ranges in a Kimball dimension table to get even more insightful date ranges.
In particular, the focus was sub-dividing the date ranges to get unique rows that can easily be used to join multiple fact tables (e.g. contract date ranges with customer registration date ranges).
This blog entry will focus on identifying the gaps between continuous date ranges instead - the ‘ocean’ between the islands!
Insurance cover example
The dataset sample relates to a simple car insurance scenario.
Car Registration Table
This table has the car registration details. Primary key is the Registration globally unique identifier (GUID)
RegistrationGUID | RegNo | OwnerName | FromDate | ToDate |
---|---|---|---|---|
1abc-djfj-1111 | 123ABC | John Smith | 2018-03-12 | 2019-03-11 |
1abc-qpqo-1121 | 123ABC | John Smith | 2019-03-12 | 2019-09-12 |
09qa-tiyo-6892 | 123ABC | Johanna More | 2019-09-13 | 2020-09-12 |
1ddd-zmbn-4019 | 333YUZ | Sam Lee | 2018-10-11 | 2019-04-01 |
10aa-bmbz-5910 | 333YUZ | Bobby Bob | 2019-04-02 | 2019-11-11 |
aa67-yuia-7280 | 090UIU | John Smith | 2018-06-22 | 2019-02-01 |
Car Insurance Policy Table
This table has the car insurance policy details. Primary key is the Policy Number.
PolicyNo | RegistrationGUID | FromDate | ToDate |
---|---|---|---|
POL10101-A | 1abc-djfj-1111 | 2018-03-12 | 2019-03-11 |
POL12191-C | 1abc-qpqo-1121 | 2019-03-12 | 2019-09-12 |
POL13019-E | 09qa-tiyo-6892 | 2019-10-31 | 2020-01-01 |
POL20190-L | 10aa-bmbz-5910 | 2019-01-04 | 2019-11-11 |
POL21101-E | 1ddd-zmbn-4019 | 2020-01-03 | 2020-09-19 |
POL28705-C | aa67-yuia-7280 | 2018-09-22 | 2019-02-01 |
POL28491-A | 1ddd-zmbn-4019 | 2018-10-11 | 2019-07-13 |
Now, using the above data, you want to know the following:
What makes date island problems challenging is conventional grouping and aggregation methods don't work
For example, if you just got the MIN start date and MAX end date where there was cover, you would not identify all the 'gaps' in between these dates. You would just end up with this:
RegistrationNo | PolicyNo | StartDate | EndDate | Cover_YN | Days |
---|---|---|---|---|---|
090UIU | None | 2018-06-22 | 2018-09-21 | N | 92 |
090UIU | POL28705-C | 2018-09-22 | 2019-02-01 | Y | 133 |
123ABC | POL13019-E | 2018-03-12 | 2020-01-01 | Y | 613 |
123ABC | None | 2019-09-13 | 2020-09-12 | N | 303 |
333YUZ | POL20190-L | 2018-10-11 | 2019-11-11 | Y | 397 |
Wait hold on? Where are all the gaps? It's just getting the date boundaries without showing the gaps!
Looks like we need a better solution! Here comes in dimension dates and dense ranking!
Dimension dates and dense ranking to the rescue!
For the purposes of this blog, I've already created a dimension date (basically just a continuous list of dates from 2018 to 2020.
If you want to see more details, there is a link to the colab notebook at the bottom with more details.
Basically the goal is to get, for everyday in the date range, what the registration and policy status of every car registration is - ie covered by policy, registered?
That will generate a table that looks something like below:
FullDate | RegistrationGUID | RegistrationNo | CarRegistrationFromDate | CarRegistrationToDate | PolicyNo | PolicyStartDate |
---|---|---|---|---|---|---|
2018-03-12 | 1abc-djfj-1111 | 123ABC | 2018-03-12 | 2019-03-11 | POL10101-A | 2018-03-12 |
2018-03-13 | 1abc-djfj-1111 | 123ABC | 2018-03-12 | 2019-03-11 | POL10101-A | 2018-03-12 |
2018-03-14 | 1abc-djfj-1111 | 123ABC | 2018-03-12 | 2019-03-11 | POL10101-A | 2018-03-12 |
2018-03-15 | 1abc-djfj-1111 | 123ABC | 2018-03-12 | 2019-03-11 | POL10101-A | 2018-03-12 |
All the date ranges of car registration and insurance policies have been 'stretched out' to one row per day in the date range. In this case, the date range is book-ended to only the car registrations (as if the car isn't registered, let's assume we don't need to worry about it).
For reference, if you wanted to create a new dimension table with the date ranges (similiar to what I did in my other previous blog entry, below is how you would do it.
The result will look something like this - with a new dimension surrogate key:
Status_SK | StartDate | EndDate | RegistrationGUID | RegistrationNo | CarRegistrationFromDate | CarRegistrationToDate | PolicyNo | PolicyStartDate | PolicyEndDate |
---|---|---|---|---|---|---|---|---|---|
1abc-djfj-1111_POL10101-A_2018-03-12 | 2018-03-12 | 2019-03-11 | 1abc-djfj-1111 | 123ABC | 2018-03-12 | 2019-03-11 | POL10101-A | 2018-03-12 | 2019-03-11 |
1abc-qpqo-1121_POL12191-C_2019-03-12 | 2019-03-12 | 2019-09-12 | 1abc-qpqo-1121 | 123ABC | 2019-03-12 | 2019-09-12 | POL12191-C | 2019-03-12 | 2019-09-12 |
09qa-tiyo-6892_POL13019-E_2019-10-31 | 2019-10-31 | 2020-01-01 | 09qa-tiyo-6892 | 123ABC | 2019-09-13 | 2020-09-12 | POL13019-E | 2019-10-31 | 2020-01-01 |
1ddd-zmbn-4019_POL28491-A_2018-10-11 | 2018-10-11 | 2019-04-01 | 1ddd-zmbn-4019 | 333YUZ | 2018-10-11 | 2019-04-01 | POL28491-A | 2018-10-11 | 2019-07-13 |
10aa-bmbz-5910_POL20190-L_2019-04-02 | 2019-04-02 | 2019-11-11 | 10aa-bmbz-5910 | 333YUZ | 2019-04-02 | 2019-11-11 | POL20190-L | 2019-01-04 | 2019-11-11 |
aa67-yuia-7280_POL28705-C_2018-09-22 | 2018-09-22 | 2019-02-01 | aa67-yuia-7280 | 090UIU | 2018-06-22 | 2019-02-01 | POL28705-C | 2018-09-22 | 2019-02-01 |
Back to Daily Table!
After our little side track, let's get back to our daily table. Even though we now can see which days are covered, we still need to group them together.
This is where Dense Rank comes in - essentially you use it to determine how 'far' a date is away from the start date of its range.
That is, you partition the data by registration number and whether its covered by a policy,so the 'rank' in a particular partition - ie how far it is away from the start date.
That way, if you minus the DENSE_RANK against the Date, you will get the number of days it is from the partition start date.
This is what you get:
FullDate | RegistrationGUID | RegistrationNo | PolicyNo | Cover_YN | DateRank | PartitionStartDate |
---|---|---|---|---|---|---|
2018-06-22 | aa67-yuia-7280 | 090UIU | None | N | 1 | 2018-06-22 |
2018-06-23 | aa67-yuia-7280 | 090UIU | None | N | 2 | 2018-06-22 |
2018-06-24 | aa67-yuia-7280 | 090UIU | None | N | 3 | 2018-06-22 |
2018-06-25 | aa67-yuia-7280 | 090UIU | None | N | 4 | 2018-06-22 |
2018-06-26 | aa67-yuia-7280 | 090UIU | None | N | 5 | 2018-06-22 |
... | ... | ... | ... | ... | ... | ... |
2019-09-13 | 09qa-tiyo-6892 | 123ABC | None | N | 1 | 2019-09-13 |
2019-09-14 | 09qa-tiyo-6892 | 123ABC | None | N | 2 | 2019-09-13 |
2019-09-15 | 09qa-tiyo-6892 | 123ABC | None | N | 3 | 2019-09-13 |
Almost there! Now just minus the dense rank and group the data!
Now all you got to do is minus the dense rank (as per the SQL above), and just group it together!
The final result with the answer!
RegistrationGUID | RegistrationNo | PolicyNo | StartDate | EndDate | Cover_YN | Days |
---|---|---|---|---|---|---|
aa67-yuia-7280 | 090UIU | None | 2018-06-22 | 2018-09-21 | N | 92 |
aa67-yuia-7280 | 090UIU | POL28705-C | 2018-09-22 | 2019-02-01 | Y | 133 |
1abc-qpqo-1121 | 123ABC | POL12191-C | 2018-03-12 | 2019-09-12 | Y | 550 |
09qa-tiyo-6892 | 123ABC | None | 2019-09-13 | 2019-10-30 | N | 48 |
09qa-tiyo-6892 | 123ABC | POL13019-E | 2019-10-31 | 2020-01-01 | Y | 63 |
09qa-tiyo-6892 | 123ABC | None | 2020-01-02 | 2020-09-12 | N | 255 |
10aa-bmbz-5910 | 333YUZ | POL20190-L | 2018-10-11 | 2019-11-11 | Y | 397 |
Nice! We can see the coverage periods between ownership and cover!
You can see that John Smith had continuous cover for his 123ABC car for 550 days, then there was a gap of 48 days and 63 days.
Visualised using a Gantt chart
Another good day to visualise this dataset is using a Gantt chart - if you do it right, you'll easily see where the gap is!
Closing Remarks
We did it! We got through and found the gaps!
If you want to fiddle with the data and write some SQL, here is a link to my Colab Notebook.