Thursday, May 19, 2011

Tip of the day: Reload your static or semistatic data continuously.

Let's say you have a table of dates and properties of those dates (like a time dimension in a star-schema in a DW). Typically someone fills it with all the dates up to some date that seems far off enough in the future, thinking that he'll never have to worry about it again.

Ok, so maybe there is some sort of script that was used to initially load the data. And when a change is necessary, such as when you need to add more dates or change the fiscal calendar, you can patch the script and reapply it, right?

Then, after a few rounds of patching by various developers, the data starts to look slightly inconsistent, and it's no longer apparent where the different fields come from. Maybe someone patched the data without commiting his script changes to source control.

The only way to avoid the risk of someone patching the static data by hand is to continuously reload it from a script. And by continuously I mean at least daily. That way, if you need to change the data, you have to change the script. And you don't have to wonder which script was used to load the data, because you can be 100% certain that it was the script that ran last night. And every night before that.

From a technical standpoint, you typically cannot delete the data and then reinsert it, as you'll have referential integrity contraints that would break, and it wouldn't be a good idea anyway to risk getting different PKs. What I do instead is I first insert any missing rows (with null or default values for non-pk columns), and then do an update on the whole table, setting not just the columns for the new rows, but for every row in the table.

If you cannot reload your data, it is at risk of being corrupted. And the only way to know that you can, is to do.