Navigating the date islands and finding the gaps - Part 2

Mind the Gap!

Posted by Albert Cheng on 11 March 2020

Last Updated: 18 May 2020

The gap is as important as the island

Identifying the gaps and how long a continuous date range is an important exercise a few industries/domains. For example, in the insurance industry, identifying whether a person has a policy covered continuously (and if not) for how long is important for calculating numerous things - e.g. eligibility, waiting periods, etc

Insurance cover is filled with continuous period requirements! Photo by Pixabay from Pexels

But first a recap on what I wrote before!

Date ranges - a trip down memory lane!

In my previous blog post, I wrote about how to fine-tune a list of date ranges in a Kimball dimension table to get even more insightful date ranges.

In particular, the focus was sub-dividing the date ranges to get unique rows that can easily be used to join multiple fact tables (e.g. contract date ranges with customer registration date ranges).

This blog entry will focus on identifying the gaps between continuous date ranges instead - the ‘ocean’ between the islands!

Insurance cover example

The dataset sample relates to a simple car insurance scenario.

Car Registration Table

This table has the car registration details. Primary key is the Registration globally unique identifier (GUID)

RegistrationGUID RegNo OwnerName FromDate ToDate
1abc-djfj-1111 123ABC John Smith 2018-03-12 2019-03-11
1abc-qpqo-1121 123ABC John Smith 2019-03-12 2019-09-12
09qa-tiyo-6892 123ABC Johanna More 2019-09-13 2020-09-12
1ddd-zmbn-4019 333YUZ Sam Lee 2018-10-11 2019-04-01
10aa-bmbz-5910 333YUZ Bobby Bob 2019-04-02 2019-11-11
aa67-yuia-7280 090UIU John Smith 2018-06-22 2019-02-01

Car Insurance Policy Table

This table has the car insurance policy details. Primary key is the Policy Number.

PolicyNo RegistrationGUID FromDate ToDate
POL10101-A 1abc-djfj-1111 2018-03-12 2019-03-11
POL12191-C 1abc-qpqo-1121 2019-03-12 2019-09-12
POL13019-E 09qa-tiyo-6892 2019-10-31 2020-01-01
POL20190-L 10aa-bmbz-5910 2019-01-04 2019-11-11
POL21101-E 1ddd-zmbn-4019 2020-01-03 2020-09-19
POL28705-C aa67-yuia-7280 2018-09-22 2019-02-01
POL28491-A 1ddd-zmbn-4019 2018-10-11 2019-07-13

Now, using the above data, you want to know the following:

  • Identifying the number of continuous days Joe Smith was covered by a car insurance for both his vehicles.
  • Grouping the continuous days together and showing the 'gaps' between continuous cover
  • What makes date island problems challenging is conventional grouping and aggregation methods don't work

    For example, if you just got the MIN start date and MAX end date where there was cover, you would not identify all the 'gaps' in between these dates. You would just end up with this:

    RegistrationNo PolicyNo StartDate EndDate Cover_YN Days
    090UIU None 2018-06-22 2018-09-21 N 92
    090UIU POL28705-C 2018-09-22 2019-02-01 Y 133
    123ABC POL13019-E 2018-03-12 2020-01-01 Y 613
    123ABC None 2019-09-13 2020-09-12 N 303
    333YUZ POL20190-L 2018-10-11 2019-11-11 Y 397

    Wait hold on? Where are all the gaps? It's just getting the date boundaries without showing the gaps!

    Looks like we need a better solution! Here comes in dimension dates and dense ranking!

    Too many date ranges, we need an anchor! Photo by ABD NIMIT from Pexels

    Dimension dates and dense ranking to the rescue!

    For the purposes of this blog, I've already created a dimension date (basically just a continuous list of dates from 2018 to 2020.

    If you want to see more details, there is a link to the colab notebook at the bottom with more details.

    Basically the goal is to get, for everyday in the date range, what the registration and policy status of every car registration is - ie covered by policy, registered?

    That will generate a table that looks something like below:

    FullDate RegistrationGUID RegistrationNo CarRegistrationFromDate CarRegistrationToDate PolicyNo PolicyStartDate
    2018-03-12 1abc-djfj-1111 123ABC 2018-03-12 2019-03-11 POL10101-A 2018-03-12
    2018-03-13 1abc-djfj-1111 123ABC 2018-03-12 2019-03-11 POL10101-A 2018-03-12
    2018-03-14 1abc-djfj-1111 123ABC 2018-03-12 2019-03-11 POL10101-A 2018-03-12
    2018-03-15 1abc-djfj-1111 123ABC 2018-03-12 2019-03-11 POL10101-A 2018-03-12

    All the date ranges of car registration and insurance policies have been 'stretched out' to one row per day in the date range. In this case, the date range is book-ended to only the car registrations (as if the car isn't registered, let's assume we don't need to worry about it).

    For reference, if you wanted to create a new dimension table with the date ranges (similiar to what I did in my other previous blog entry, below is how you would do it.

    The result will look something like this - with a new dimension surrogate key:

    Status_SK StartDate EndDate RegistrationGUID RegistrationNo CarRegistrationFromDate CarRegistrationToDate PolicyNo PolicyStartDate PolicyEndDate
    1abc-djfj-1111_POL10101-A_2018-03-12 2018-03-12 2019-03-11 1abc-djfj-1111 123ABC 2018-03-12 2019-03-11 POL10101-A 2018-03-12 2019-03-11
    1abc-qpqo-1121_POL12191-C_2019-03-12 2019-03-12 2019-09-12 1abc-qpqo-1121 123ABC 2019-03-12 2019-09-12 POL12191-C 2019-03-12 2019-09-12
    09qa-tiyo-6892_POL13019-E_2019-10-31 2019-10-31 2020-01-01 09qa-tiyo-6892 123ABC 2019-09-13 2020-09-12 POL13019-E 2019-10-31 2020-01-01
    1ddd-zmbn-4019_POL28491-A_2018-10-11 2018-10-11 2019-04-01 1ddd-zmbn-4019 333YUZ 2018-10-11 2019-04-01 POL28491-A 2018-10-11 2019-07-13
    10aa-bmbz-5910_POL20190-L_2019-04-02 2019-04-02 2019-11-11 10aa-bmbz-5910 333YUZ 2019-04-02 2019-11-11 POL20190-L 2019-01-04 2019-11-11
    aa67-yuia-7280_POL28705-C_2018-09-22 2018-09-22 2019-02-01 aa67-yuia-7280 090UIU 2018-06-22 2019-02-01 POL28705-C 2018-09-22 2019-02-01

    Back to Daily Table!

    After our little side track, let's get back to our daily table. Even though we now can see which days are covered, we still need to group them together.

    This is where Dense Rank comes in - essentially you use it to determine how 'far' a date is away from the start date of its range.

    That is, you partition the data by registration number and whether its covered by a policy,so the 'rank' in a particular partition - ie how far it is away from the start date.

    That way, if you minus the DENSE_RANK against the Date, you will get the number of days it is from the partition start date.

    This is what you get:

    FullDate RegistrationGUID RegistrationNo PolicyNo Cover_YN DateRank PartitionStartDate
    2018-06-22 aa67-yuia-7280 090UIU None N 1 2018-06-22
    2018-06-23 aa67-yuia-7280 090UIU None N 2 2018-06-22
    2018-06-24 aa67-yuia-7280 090UIU None N 3 2018-06-22
    2018-06-25 aa67-yuia-7280 090UIU None N 4 2018-06-22
    2018-06-26 aa67-yuia-7280 090UIU None N 5 2018-06-22
    ... ... ... ... ... ... ...
    2019-09-13 09qa-tiyo-6892 123ABC None N 1 2019-09-13
    2019-09-14 09qa-tiyo-6892 123ABC None N 2 2019-09-13
    2019-09-15 09qa-tiyo-6892 123ABC None N 3 2019-09-13

    Almost there! Now just minus the dense rank and group the data!

    Now all you got to do is minus the dense rank (as per the SQL above), and just group it together!

    The final result with the answer!

    RegistrationGUID RegistrationNo PolicyNo StartDate EndDate Cover_YN Days
    aa67-yuia-7280 090UIU None 2018-06-22 2018-09-21 N 92
    aa67-yuia-7280 090UIU POL28705-C 2018-09-22 2019-02-01 Y 133
    1abc-qpqo-1121 123ABC POL12191-C 2018-03-12 2019-09-12 Y 550
    09qa-tiyo-6892 123ABC None 2019-09-13 2019-10-30 N 48
    09qa-tiyo-6892 123ABC POL13019-E 2019-10-31 2020-01-01 Y 63
    09qa-tiyo-6892 123ABC None 2020-01-02 2020-09-12 N 255
    10aa-bmbz-5910 333YUZ POL20190-L 2018-10-11 2019-11-11 Y 397

    Nice! We can see the coverage periods between ownership and cover!

    You can see that John Smith had continuous cover for his 123ABC car for 550 days, then there was a gap of 48 days and 63 days.

    Visualised using a Gantt chart

    Another good day to visualise this dataset is using a Gantt chart - if you do it right, you'll easily see where the gap is!

    Closing Remarks

    We navigated the gaps in the islands! Photo by Pixabay from Pexels

    We did it! We got through and found the gaps!

    If you want to fiddle with the data and write some SQL, here is a link to my Colab Notebook.