I have a confession. I enjoy edge cases. More than I probably should. I take an unhealthy amount of pleasure in finding those unexpected oddities and poking holes in the assumptions and “soft rules” that business users and even developers make about their processes, applications, and most importantly, their data.
But let’s back up a minute. What exactly is an edge case? Well, I’ll quote Wikipedia’s definition at you, then I’ll put my own spin on it.
An edge case is a problem or situation that occurs only at an extreme (maximum or minimum) operating parameter.
In my mind, there are really two types of edge cases. Although the results and treatment are largely similar, so it’s not a terribly important distinction, but for conversation’s sake, here they are.
- Known: while we can (and usually do) identify these cases, they are typically thought of as ultra-rare and/or “too difficult to reproduce”; thus, the plan for handling them involves tedious one-off or ad-hoc procedures, which are often completely undocumented.
- Unknown: the saying goes, “You don’t know what you don’t know” – these cases don’t even cross our minds as a possibility, either due to ignorance or because they are [supposedly] contrary to our business rules or our [assumed/remembered] application logic.
Again, the end result is the same: panic, discord, technical debt, and wasted hours of remediation. So why do we do this to ourselves? Well, one common justification tends to be “Oh, that’ll never happen!”, and we sweep it under the rug. Then there’s the laziness, busy-ness / lack of time, pressure to deliver, gap in abilities or tool-sets, passing the buck, etc. We’re all guilty of at least two of these in any given week.
So let’s move on to the important part: What can we do about it? Largely, we can simply take the excuses & reasons and simply turn them on their heads. Take ownership, commit to learning, communicate with management, make time for planning and documentation, and once a path is laid out for remediation, actually do the work. It’s often not easy or pretty, but a little pain now beats a lot of pain later.
I know, easier said than done, right? :o)
Let’s say our sales offices are closed on Sunday – this is our “operating assumption”. Therefore, no orders will be processed on Sundays – this is our “business rule”. So because of this, we’ve decided to do some ETL processing and produce a revenue report for the previous week. Now, we’re using some antiquated tooling, so our first batch of ETL, which takes the orders from the sales system and loads them into the bookkeeping system, runs from, say, 2am to about 5am. Then we have a second batch, which moves that bookkeeping data into the report-staging area, from 6am to about 7am. We need those hours of “buffer zones” because the ETL times are unpredictable. And finally, our reporting engine churns & burns at 8am. But along comes Overachieving Oliver, on a Sunday at 5:30am, and he’s processed a couple orders from the other day (so perhaps he’s really Underachieving Oliver, but he’s trying to make up for it, because he enjoys alliteration almost as much as I do).
Woah nelly! What happened? Oliver’s sales didn’t make it into the report! Not only that, but they don’t even exist according to bookkeeping. But come Monday, if he tries to re-process those orders, he’s probably going to get an error because they’re already in the sales system. So now he’s gotta get IT involved – probably an analyst, a developer, and maybe even a DBA. Damn, that’s a lot of resources expended because an assumption and a rule were broken!
Here’s another one. An order consists of 1 or more order-lines, which each contain 1 or more of a given item (product). Let’s say we store these in our database in a table called
OrderLines, and each line has a
LineNumber. Now, we have an assumption that those
LineNumbers are always sequential. It’s even a rule in our applications – maybe not all parts or all modules, but at least some, and enough to cause a fuss if there’s a gap in that sequence or if a line is moved around somehow without proper data dependency updates (which, by the way, are application-controlled, not database-controlled). Plus there some manager who depends on this old reporting metric that also breaks when those line numbers are out-of-whack. But this should never happen, right?
The operative word there being “should”. But apparently there was a bug in an “update order” routine that left a gap in the sequence. Or maybe the DBA was asked to delete something from an order post-mortem, and there’s no way within the app’s ecosystem to do it, so he had to write some queries to make it work. And guess what? Because the Dev team is super busy working on the hot new feature that everybody wants, it will be 2 weeks before they can circle back around to address that update-bug or add that utility function for line-deletion. So now the DBA’s writing a stored-proc to wrap in a scheduled job to “fix” those order-line sequences every night, to prevent that one app module from breaking and keep those management reports accurate. And, to quote my very first post ever, “the DBA waxe[s] wroth”.
So, prevention? Well, I’d probably start by locking down the order entry system during that off-limits window. We should also wire-up those big 3 processes so that there’s no need for indeterminate buffer-zones and inconsistent timing. And yeah, it could be a big change needing lots of team buy-in and approvals, but it’s worth the investment. Right? Right!
Hope you enjoyed reading! Until next time…
Drafted with StackEdit, finished with WordPress