HowTo: Create a date dimension with PDI (Kettle)
Alright... I've shown you how to build a PDI executable from SVN. Now we should use this, shall we?
Why create a date dimension?
This question was discussed many times and Google will get you an insight. I had to discuss this myself at more than one occasion both in my projects and at conferences like TDWI's European Conference at Amsterdam.
To cut this short: I think that a date dimension can offer you much more value than the database or your reporting tool can offer. Both know nothing about the sales seasons your business follows. Neither of them know about the fiscal year cycle you have to follow. And I doubt that many tools know when Easter Sunday is.
All this and almost anything else you can think of will be handled by a date dimension.
How does a date dimension look like?
Like any good dimension our date dimension will have a key (ID) and lots and lots of attributes. Each row in the dimension will represent exactly one day. We will add columns for different formats of a date - this is not necessary for many of you as your tools know how to format a date, but it will make it easier to enforce a standard upon all reports from all systems. Plus, some tools are not aware of special stuff like a week's year - yeah, weeks do have their own year.
Also we want to add stuff that SQL can't tell us. In this example I'll show you how to add the Christian holidays like Easter which have no fixed date. You can easily add your own flavour of holiday or season.
ID? I heard of surrogate keys and smart keys...
Yeah... another old discussion. We'll go with smart keys as they do come in handy. And no, we'll not use the date itself as a key. Too tempting and too fragile if you add in stuff like 'n/a', 'unknown' or 'malformed in the input'.
The set up
Okay... you'll need PDI for this. Go to Pentaho and download it. Or build your very own version like I told you before.
At the end of the article I'll link the transformation - if you're lazy and know your way around data warehousing and Pentaho's kettle tool, skip ahead and miss all the fun.
Still reading? Good - I'll walk you through this and explain stuff.
Overview
This is how the transformation looks like in Spoon. There are basically two streams - on the left side single days are flying by and they are enhanced with more and more information about the date they represent.
On the right hand side we have single years and we're adding information about special dates (like Easter Sunday) to each of the years.
Generating days
We use a "Generate rows" step to do this. All we need is a date field that contains the date of the very first day of our date dimension. Make sure you're looking far enough in the past. Think of people's birthday in your CRM system, think of company's create date in your customer database, think of "When was the last time I was home early?" - see... you'll need to go far into the past for this to work!
Remember to set the format - that's important.
We'll add a sequence to our date. It will start with the number 0 and we'll add one for each row that passes this step. This will allow us to have different dates than just the first of January 1900 we've set in the "Generate rows" step.
More attributes, please!
As I wrote above we'll add some nice formatting and we start with 'month in year', 'ISO week number' and stuff like that. PDI offers the "Calculator" step for many date based calculations and we're using it now.
The most important column we're adding to our date row is the one called "iso_year_of_week' - calculating the year a week falls into is not trivial and you'll need it most likely.
I didn't set the type for these columns - it's safer to set it but I was lazy.
The next step is to add our date-type date in different formats to our row. This is a heavily-debated topic. I wrote some words about this above and I'll stick with: I think it's a good idea.
So what we do is: we add a "Select values" step and use it's 'Metadata' tab to change the format of our columns. We're using Java-syntax for this - and be aware this whole thing it locale aware. So while I'm getting "Januar" you might get "January" or "Janvier" or something.
This is where I tricked you! Have you noticed that I'm changing the format of columns that were never created? Well... I created them on-the-fly in the 'Select & Alter' tab of this step. You can specify a source column multiple times if you remember to rename it. Now you get copies of that column and you can alter their meta-data on the right-most tab.
Do I really need to work!?
Now this is the nicest part of the date dimension - weekends! I love them and therefore they must be included. What I'm using is probably not the most straight forward way but it came to my mind.
So what we're doing is this: we create two rows that contain information about weekends - one row for Saturday and one for Sunday. They need some kind of key so we can join the information with what we have and all the attributes we like about weekends (I've only put is_weekend there).
I did this with two separate "Generate rows" steps that get pushed in a "Dummy" step - this adds two streams in a random way (there is also a step to add them one after the other).
Now we need to join these two rows with the hundreds and thousands of lines we're getting from our initial "Generate rows" step that produced all these single day rows.
We're using a "Stream Lookup" step for this as it's perfect for this. It stores the two rows coming from the step indicated in the 'Lookup step' dropdown box. We need to specify the join key (day_of_week is the same as day_in_week in our example) and we can cherry-pick the attributes we'd like on our thousands of rows. Just add them where I have the 'is_weekend' column.
Can I have a look?
We've been doing a lot so far. We created all those single days, added attributes with the "Calculator" step, added the date in different formats and we've put the weekend flag in there. It's time to have a look, ey?
Right-Click the last step (Stream Lookup in this example) and choose to preview. It should look like this:
Eastern - the base for many holy days
Some dates are not based on their numeric value but have religious roots. In the Christian religion they are based on the date of the Easter Sunday. In order to have a nice date dimension we will add this special day and some of the days we can derive from it.
They way we're doing this should work for other calendars as well (Orthodox Christians have their own way to calculate theses days, other religions like Islam have a completely different set of calandaric calculations).
First of all we create rows representing single years now. We then want to add the date of Easter Sunday and it's derivatives. We also use these single years to add some fixed holidays in there as well - like New Year's Eve.
Have a look at the overview:
We're doing the same thing as before - create rows with a base date, add a sequence to it creating single years and then the fancy stuff.
Calculating Easter Sunday
We're using a well known formula to calculate the date of Easter Sunday. As performance is not of the essence in this one-time executed transformation we're using the "Java Script" step which is generally slow but allows you to do a lot of stuff. Mind you: if you find a combination of other steps it will most likely be faster to use these instead of the JavaScript.
Now we have single year rows that have the date of Easter Sunday as an attribute.
Adding the derivatives
Again we're using the "Calculator" step. We create temporary columns with the number of days a derivative is 'off' Easter Sunday and use these to calculate the proper date in the year we're handling.
Make it joinable
We're getting to a difficult step now - how do we join our single-year-rows with our single-day-rows based on a number of columns that contain the holiday's date?
Well... we don't. What we need to do now is transform each row into multiple rows of which each represents one day (one of our holidays). Thus each year-row is split into many day-rows that can be easily joined.
For this we use the "Row Normalizer" step:
The trick is to give the same 'new field' to all of our dates - thus we create just one field with many different values. The 'Type' column will help us identify our holiday later on.
Now let's have a look at before and after:
Again we use the "Stream Lookup" to join our many-single-days stream with our few-special-days stream.
Ordinary holidays
As mentioned before there are also ordinary holidays that have a fixed day. Like New Year's Eve on the 31st of December. We also create those.
To do this we copy all our single-year-rows before we calculate Easter Sunday and put these lines into a "Calculator" step.
Again we add temporary columns, this time strings having the day and month part of our date. Adding these to our year column we get a string representing the proper date. We just have to set a format (yyyy-MM-dd) and we can specify the Date type of the columns.
Do I still have to work!?
Now... with the weekend flag we have a nice indicator telling us that we're facing a day off or two - with the religious holidays and our ordinary holidays added we can also add a flag 'holiday' to our columns. I decided to add some more logic to it. Each day can be a moveable holiday, a fixed one, both or nothing of this.
Again we're using the "Java Script" step to calculate this logic:
One last look at it...
So this is what we're getting. A stream of single days with lots of indicators, lots of formats and attributes. The stream knows about movable and fixed holidays.
You can now start adding stuff specific to your use case - add fiscal dates, add sales seasons, add important sports events if you're selling beer. You should now have the tools at hand and a rough idea on how to use them.
Please leave me a note in the comments if you find worthwhile attributes you'd like to share or that you're having a hard time implementing - maybe I can help.
The transformation
All the lazy people: welcome back, you missed a lot!
All the others: You earned it. Get your copy now.
Example transformation to create a date dimension